[jira] [Assigned] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21072: Assignee: (was: Apache Spark) > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code should judge whether it is the children node. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047431#comment-16047431 ] Apache Spark commented on SPARK-21072: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/18284 > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code should judge whether it is the children node. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21072: Assignee: Apache Spark > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu >Assignee: Apache Spark > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code should judge whether it is the children node. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-21072: - Description: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` was: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] > should judge whether it is the children node. > ``` > case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => > val newChild1 = f(arg1.asInstanceOf[BaseType]) > val newChild2 = f(arg2.asInstanceOf[BaseType]) > if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) > { > changed = true > (newChild1, newChild2) > } else { > tuple > } > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-21072: - Description: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] was: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code should judge whether it is the children node. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-21072: - Description: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` was: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] > should judge whether it is the children node. > ``` > case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => > val newChild1 = f(arg1.asInstanceOf[BaseType]) > val newChild2 = f(arg2.asInstanceOf[BaseType]) > if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) > { > changed = true > (newChild1, newChild2) > } else { > tuple > } > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
[ https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-21072: - Description: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` was: Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` > `TreeNode.mapChildren` should only apply to the children node. > --- > > Key: SPARK-21072 > URL: https://issues.apache.org/jira/browse/SPARK-21072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: coneyliu > > Just as the function name and comments of `TreeNode.mapChildren` mentioned, > the function should be apply to all currently node children. So, the follow > code > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] > should judge whether it is the children node. > ``` > case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => > val newChild1 = f(arg1.asInstanceOf[BaseType]) > val newChild2 = f(arg2.asInstanceOf[BaseType]) > if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) > { > changed = true > (newChild1, newChild2) > } else { > tuple > } > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.
coneyliu created SPARK-21072: Summary: `TreeNode.mapChildren` should only apply to the children node. Key: SPARK-21072 URL: https://issues.apache.org/jira/browse/SPARK-21072 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: coneyliu Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342] should judge whether it is the children node. ``` case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) => val newChild1 = f(arg1.asInstanceOf[BaseType]) val newChild2 = f(arg2.asInstanceOf[BaseType]) if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) { changed = true (newChild1, newChild2) } else { tuple } ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19910. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway in Python 2+
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Environment: It has problem only in Python 2+. Python 3+ is ok. Summary: Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway in Python 2+ (was: Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway) > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway in Python 2+ > > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 > Environment: It has problem only in Python 2+. > Python 3+ is ok. >Reporter: Joshuawangzj > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['δΈ']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA
[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047370#comment-16047370 ] Vincent commented on SPARK-20988: - I can work on this if no one is working on it now :) > Convert logistic regression to new aggregator framework > --- > > Key: SPARK-20988 > URL: https://issues.apache.org/jira/browse/SPARK-20988 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21068) SparkR error message when passed an R object rather than Java object could be more informative
[ https://issues.apache.org/jira/browse/SPARK-21068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047338#comment-16047338 ] Felix Cheung commented on SPARK-21068: -- surely. what brings you to R land? :) > SparkR error message when passed an R object rather than Java object could be > more informative > -- > > Key: SPARK-21068 > URL: https://issues.apache.org/jira/browse/SPARK-21068 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0, 2.3.0 >Reporter: holdenk >Priority: Trivial > > SparkR when passed a non-Java object and expecting a Java object SparkR's > backend code has an error message which is `Error: clas(objId) == "jobj" is > not TRUE`. > See backend.R and the isInstanceOf function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21028) Parallel One vs. Rest Classifier Scala
[ https://issues.apache.org/jira/browse/SPARK-21028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047258#comment-16047258 ] Apache Spark commented on SPARK-21028: -- User 'ajaysaini725' has created a pull request for this issue: https://github.com/apache/spark/pull/18281 > Parallel One vs. Rest Classifier Scala > -- > > Key: SPARK-21028 > URL: https://issues.apache.org/jira/browse/SPARK-21028 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Right now the Scala implementation of the one vs. rest algorithm allows for > parallelism but does not allow for the amount of parallelism to be tuned. > Adding a tunable parameter for the number of jobs to run in parallel at a > time would be useful because it would allow the user to adjust the level of > parallelism to be optimal for their task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21027: -- Shepherd: Joseph K. Bradley > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21071) remove append APIs and simplify array writing logic
Wenchen Fan created SPARK-21071: --- Summary: remove append APIs and simplify array writing logic Key: SPARK-21071 URL: https://issues.apache.org/jira/browse/SPARK-21071 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14450) Python OneVsRest should train multiple models at once
[ https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-14450. - Resolution: Duplicate > Python OneVsRest should train multiple models at once > - > > Key: SPARK-14450 > URL: https://issues.apache.org/jira/browse/SPARK-14450 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues > related to using existing libraries like {{multiprocessing}}, we are not > training multiple models in parallel initially. > This issue is for prototyping, testing, and implementing a way to train > multiple models at once. Speaking with [~joshrosen], a good option might be > the concurrent.futures package: > * Python 3.x: > [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] > * Python 2.x: [https://pypi.python.org/pypi/futures] > We will *not* add this for Spark 2.0, but it will be good to investigate for > 2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once
[ https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047245#comment-16047245 ] Joseph K. Bradley commented on SPARK-14450: --- See linked JIRA for new issue. > Python OneVsRest should train multiple models at once > - > > Key: SPARK-14450 > URL: https://issues.apache.org/jira/browse/SPARK-14450 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues > related to using existing libraries like {{multiprocessing}}, we are not > training multiple models in parallel initially. > This issue is for prototyping, testing, and implementing a way to train > multiple models at once. Speaking with [~joshrosen], a good option might be > the concurrent.futures package: > * Python 3.x: > [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] > * Python 2.x: [https://pypi.python.org/pypi/futures] > We will *not* add this for Spark 2.0, but it will be good to investigate for > 2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once
[ https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047244#comment-16047244 ] Joseph K. Bradley commented on SPARK-14450: --- Scala already has parallelization. I just rediscovered this issue...so I'll close this old copy. > Python OneVsRest should train multiple models at once > - > > Key: SPARK-14450 > URL: https://issues.apache.org/jira/browse/SPARK-14450 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues > related to using existing libraries like {{multiprocessing}}, we are not > training multiple models in parallel initially. > This issue is for prototyping, testing, and implementing a way to train > multiple models at once. Speaking with [~joshrosen], a good option might be > the concurrent.futures package: > * Python 3.x: > [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] > * Python 2.x: [https://pypi.python.org/pypi/futures] > We will *not* add this for Spark 2.0, but it will be good to investigate for > 2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047243#comment-16047243 ] Joseph K. Bradley commented on SPARK-21027: --- Copying from [ML-14450]: [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues related to using existing libraries like {{multiprocessing}}, we are not training multiple models in parallel initially. This issue is for prototyping, testing, and implementing a way to train multiple models at once. Speaking with [~joshrosen], a good option might be the concurrent.futures package: * Python 3.x: [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] * Python 2.x: [https://pypi.python.org/pypi/futures] > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047243#comment-16047243 ] Joseph K. Bradley edited comment on SPARK-21027 at 6/12/17 11:54 PM: - Copying from [SPARK-14450]: [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues related to using existing libraries like {{multiprocessing}}, we are not training multiple models in parallel initially. This issue is for prototyping, testing, and implementing a way to train multiple models at once. Speaking with [~joshrosen], a good option might be the concurrent.futures package: * Python 3.x: [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] * Python 2.x: [https://pypi.python.org/pypi/futures] was (Author: josephkb): Copying from [ML-14450]: [SPARK-7861] adds a Python wrapper for OneVsRest. Because of possible issues related to using existing libraries like {{multiprocessing}}, we are not training multiple models in parallel initially. This issue is for prototyping, testing, and implementing a way to train multiple models at once. Speaking with [~joshrosen], a good option might be the concurrent.futures package: * Python 3.x: [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures] * Python 2.x: [https://pypi.python.org/pypi/futures] > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047241#comment-16047241 ] Joseph K. Bradley commented on SPARK-21027: --- Whoops! I realized I'd reported this long ago...and I'd said there might be issues with python multiprocessing. I don't recall what those issues were. Let's investigate. > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047233#comment-16047233 ] Dayou Zhou commented on SPARK-18294: Thanks for responding. My colleague Aarati Khobare will provide you with the details. > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite
[ https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-21047: - Summary: Add test suites for complicated cases in ColumnarBatchSuite (was: Add test suites for complicated cases) > Add test suites for complicated cases in ColumnarBatchSuite > --- > > Key: SPARK-21047 > URL: https://issues.apache.org/jira/browse/SPARK-21047 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnarBatchSuite}} has very simple test cases for array. This > JIRA will add test suites for complicated cases such as nested array in > {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047229#comment-16047229 ] Jiang Xingbo commented on SPARK-18294: -- This is actually legacy code refactoring, it shouldn't affect common user case because the old code is still valid. Could you expand on why you need this? > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047196#comment-16047196 ] Dayou Zhou commented on SPARK-18294: Hi [~jiangxb1987][~jiangxb], Thank you for making this fix -- we really need this fix. Just checking with you the status on this and when you can merge it into master? > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module
[ https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21070: Assignee: Apache Spark > Pick up cloudpickle upgrades from cloudpickle python module > --- > > Key: SPARK-21070 > URL: https://issues.apache.org/jira/browse/SPARK-21070 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0, 2.1.1 >Reporter: Kyle Kelley >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module
[ https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21070: Assignee: (was: Apache Spark) > Pick up cloudpickle upgrades from cloudpickle python module > --- > > Key: SPARK-21070 > URL: https://issues.apache.org/jira/browse/SPARK-21070 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0, 2.1.1 >Reporter: Kyle Kelley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module
[ https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047153#comment-16047153 ] Apache Spark commented on SPARK-21070: -- User 'rgbkrk' has created a pull request for this issue: https://github.com/apache/spark/pull/18282 > Pick up cloudpickle upgrades from cloudpickle python module > --- > > Key: SPARK-21070 > URL: https://issues.apache.org/jira/browse/SPARK-21070 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0, 2.1.1 >Reporter: Kyle Kelley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module
Kyle Kelley created SPARK-21070: --- Summary: Pick up cloudpickle upgrades from cloudpickle python module Key: SPARK-21070 URL: https://issues.apache.org/jira/browse/SPARK-21070 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.1.1, 2.1.0, 2.0.0 Reporter: Kyle Kelley Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21069) Add rate source to programming guide
[ https://issues.apache.org/jira/browse/SPARK-21069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21069: - Labels: starter (was: ) > Add rate source to programming guide > > > Key: SPARK-21069 > URL: https://issues.apache.org/jira/browse/SPARK-21069 > Project: Spark > Issue Type: Documentation > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu > Labels: starter > > SPARK-20979 added a new structured streaming source: rate source. We should > document it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20979) Add a rate source to generate values for tests and benchmark
[ https://issues.apache.org/jira/browse/SPARK-20979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20979: - Affects Version/s: (was: 2.2.0) 2.3.0 > Add a rate source to generate values for tests and benchmark > > > Key: SPARK-20979 > URL: https://issues.apache.org/jira/browse/SPARK-20979 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047127#comment-16047127 ] Apache Spark commented on SPARK-21027: -- User 'ajaysaini725' has created a pull request for this issue: https://github.com/apache/spark/pull/18281 > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21027: Assignee: (was: Apache Spark) > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21027: Assignee: Apache Spark > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini >Assignee: Apache Spark > > Currently, the Scala implementation of OneVsRest allows the user to run a > parallel implementation in which each class is evaluated in a different > thread. This implementation allows up to a 2X speedup as determined by > experiments but is not currently not tunable. Furthermore, the python > implementation of OneVsRest does not parallelize at all. It would be useful > to add a parallel, tunable implementation of OneVsRest to the python library > in order to speed up the algorithm. > A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21069) Add rate source to programming guide
Shixiong Zhu created SPARK-21069: Summary: Add rate source to programming guide Key: SPARK-21069 URL: https://issues.apache.org/jira/browse/SPARK-21069 Project: Spark Issue Type: Documentation Components: Documentation, Structured Streaming Affects Versions: 2.3.0 Reporter: Shixiong Zhu SPARK-20979 added a new structured streaming source: rate source. We should document it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20979) Add a rate source to generate values for tests and benchmark
[ https://issues.apache.org/jira/browse/SPARK-20979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20979. -- Resolution: Fixed Fix Version/s: 2.3.0 > Add a rate source to generate values for tests and benchmark > > > Key: SPARK-20979 > URL: https://issues.apache.org/jira/browse/SPARK-20979 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20379) Allow setting SSL-related passwords through env variables
[ https://issues.apache.org/jira/browse/SPARK-20379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-20379: I'll re-open this since the SSL options don't really use the code path that can reference env variables in config values, so this is still something we should add. Fix will be different than my previous PR though. > Allow setting SSL-related passwords through env variables > - > > Key: SPARK-20379 > URL: https://issues.apache.org/jira/browse/SPARK-20379 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Currently, Spark reads all SSL options from configuration, which can be > provided in a file or through the command line. This means that to set the > SSL keystore / trust store / key passwords, you have to use one of those > options. > Using the command line for that is not secure, and in some environments > admins prefer to not have the password written in plain text in a file (since > the file and the data it's protecting could be stored in the same > filesystem). So for these cases it would be nice to be able to provide these > passwords through environment variables, which are not written to disk and > also not readable by other users on the same machine. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21068) SparkR error message when passed an R object rather than Java object could be more informative
holdenk created SPARK-21068: --- Summary: SparkR error message when passed an R object rather than Java object could be more informative Key: SPARK-21068 URL: https://issues.apache.org/jira/browse/SPARK-21068 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.2.0, 2.3.0 Reporter: holdenk Priority: Trivial SparkR when passed a non-Java object and expecting a Java object SparkR's backend code has an error message which is `Error: clas(objId) == "jobj" is not TRUE`. See backend.R and the isInstanceOf function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions
[ https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21050. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 18265 [https://github.com/apache/spark/pull/18265] > ml word2vec write has overflow issue in calculating numPartitions > - > > Key: SPARK-21050 > URL: https://issues.apache.org/jira/browse/SPARK-21050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.2.1, 2.3.0 > > > The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib > version), so it is very easily to have an overflow in calculating the number > of partitions for ML persistence. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20499: -- Fix Version/s: 2.2.0 > Spark MLlib, GraphX 2.2 QA umbrella > --- > > Key: SPARK-20499 > URL: https://issues.apache.org/jira/browse/SPARK-20499 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.2.0 > > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-20508].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20507) Update MLlib, GraphX websites for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20507: -- Fix Version/s: 2.2.0 > Update MLlib, GraphX websites for 2.2 > - > > Key: SPARK-20507 > URL: https://issues.apache.org/jira/browse/SPARK-20507 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Critical > Fix For: 2.2.0 > > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21059) LikeSimplification can NPE on null pattern
[ https://issues.apache.org/jira/browse/SPARK-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21059. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 > LikeSimplification can NPE on null pattern > -- > > Key: SPARK-21059 > URL: https://issues.apache.org/jira/browse/SPARK-21059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20345) Fix STS error handling logic on HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20345. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 2.2.1 > Fix STS error handling logic on HiveSQLException > > > Key: SPARK-20345 > URL: https://issues.apache.org/jira/browse/SPARK-20345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.2.1, 2.3.0 > > > [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] > added Spark Thrift Server UI and the following logic to handle exceptions on > case `Throwable`. > {code} > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > {code} > However, there occurred a missed case after implementing > [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s > `Support Cancellation in the Thrift Server` by adding case > `HiveSQLException` before case `Throwable`. > Logically, we had better add `HiveThriftServer2.listener.onStatementError` on > case `HiveSQLException`, too. > {code} > case e: HiveSQLException => > if (getStatus().getState() == OperationState.CANCELED) { > return > } else { > setState(OperationState.ERROR) > throw e > } > // Actually do need to catch Throwable as some failures don't inherit > from Exception and > // HiveServer will silently swallow them. > case e: Throwable => > val currentState = getStatus().getState() > logError(s"Error executing query, currentState $currentState, ", e) > setState(OperationState.ERROR) > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > throw new HiveSQLException(e.toString) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
Dominic Ricard created SPARK-21067: -- Summary: Thrift Server - CTAS fail with Unable to move source Key: SPARK-21067 URL: https://issues.apache.org/jira/browse/SPARK-21067 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Environment: Yarn Hive MetaStore HDFS (HA) Reporter: Dominic Ricard After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS would fail, sometimes... Most of the time, the CTAS would work only once after starting the thrift server. After that, dropping the table and re-issuing the same CTAS would fail with the following message (Sometime, it fails right away, sometime it work for a long period of time): {noformat} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 to destination hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; (state=,code=0) {noformat} We have already found the following Jira (https://issues.apache.org/jira/browse/SPARK-11021) which state that the {{hive.exec.stagingdir}} had to be added in order for Spark to be able to handle CREATE TABLE properly as of 2.0. As you can see in the error, we have ours set to "/tmp/hive-staging/\{user.name\}" Same issue with INSERT statements: {noformat} CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE dricard.test SELECT 1; Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 to destination hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; (state=,code=0) {noformat} This worked fine in 1.6.2, which we currently run in our Production Environment but since 2.0+, we haven't been able to CREATE TABLE consistently on the cluster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047045#comment-16047045 ] Zhenhua Wang commented on SPARK-17642: -- [~mbasmanova] I've reopened and rebased the above PR. IMO, I also think security should not be a blocking issue for the PR since now we don't have security mechanism inside spark. Let's see others' comments. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp
[ https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-17914. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 18252 [https://github.com/apache/spark/pull/18252] > Spark SQL casting to TimestampType with nanosecond results in incorrect > timestamp > - > > Key: SPARK-17914 > URL: https://issues.apache.org/jira/browse/SPARK-17914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Oksana Romankova >Assignee: Anton Okolnychyi > Fix For: 2.2.1, 2.3.0 > > > In some cases when timestamps contain nanoseconds they will be parsed > incorrectly. > Examples: > "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567" > "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678" > The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It > assumes that only 6 digit fraction of a second will be passed. > With this being the case I would suggest either discarding nanoseconds > automatically, or throw an exception prompting to pre-format timestamps to > microsecond precision first before casting to the Timestamp. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp
[ https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-17914: - Assignee: Anton Okolnychyi > Spark SQL casting to TimestampType with nanosecond results in incorrect > timestamp > - > > Key: SPARK-17914 > URL: https://issues.apache.org/jira/browse/SPARK-17914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Oksana Romankova >Assignee: Anton Okolnychyi > > In some cases when timestamps contain nanoseconds they will be parsed > incorrectly. > Examples: > "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567" > "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678" > The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It > assumes that only 6 digit fraction of a second will be passed. > With this being the case I would suggest either discarding nanoseconds > automatically, or throw an exception prompting to pre-format timestamps to > microsecond precision first before casting to the Timestamp. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-21065: --- Description: My streaming application has 200+ output operations, many of them stateful and several of them windowed. In an attempt to reduce the processing times, I set "spark.streaming.concurrentJobs" to 2+. Initial results are very positive, cutting our processing time from ~3 minutes to ~1 minute, but eventually we encounter an exception as follows: (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a batch from 45 minutes before the exception is thrown.) 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener StreamingJobProgressListener threw an exception java.util.NoSuchElementException: key not found 149697756 ms at scala.collection.MalLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) ... The Spark code causing the exception is here: https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 override def onOutputOperationCompleted( outputOperationCompleted: StreamingListenerOutputOperationCompleted): Unit = synchronized { // This method is called before onBatchCompleted {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) } It seems to me that it may be caused by that batch being removed earlier. https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = { synchronized { waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} val batchUIData = BatchUIData(batchCompleted.batchInfo) completedBatchUIData.enqueue(batchUIData) if (completedBatchUIData.size > batchUIDataLimit) { val removedBatch = completedBatchUIData.dequeue() batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) } totalCompletedBatches += 1L totalProcessedRecords += batchUIData.numRecords } } What is the solution here? Should I make my spark streaming context remember duration a lot longer? ssc.remember(batchDuration * rememberMultiple) Otherwise, it seems like there should be some kind of existence check on runningBatchUIData before dereferencing it. was: My streaming application has 200+ output operations, many of them stateful and several of them windowed. In an attempt to reduce the processing times, I set "spark.streaming.concurrentJobs" to 2+. Initial results are very positive, cutting our processing time from ~3 minutes to ~1 minute, but eventually we encounter an exception as follows: Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a batch from 45 minutes before the exception is thrown. 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener StreamingJobProgressListener threw an exception java.util.NoSuchElementException: key not found 149697756 ms at scala.collection.MalLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) ... The Spark code
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046957#comment-16046957 ] Sital Kedia commented on SPARK-18838: - [~joshrosen] - The PR for my change to multi-thread the event processor has been closed inadvertently as a stale PR. Can you take a look at the approach and let me know what you think. Please note that this change is very critical for running large jobs in our environment. https://github.com/apache/spark/pull/16291 > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20434) Move Hadoop delegation token code from yarn to core
[ https://issues.apache.org/jira/browse/SPARK-20434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-20434: Summary: Move Hadoop delegation token code from yarn to core (was: Move Kerberos delegation token code from yarn to core) > Move Hadoop delegation token code from yarn to core > --- > > Key: SPARK-20434 > URL: https://issues.apache.org/jira/browse/SPARK-20434 > Project: Spark > Issue Type: Task > Components: Mesos, Spark Core, YARN >Affects Versions: 2.1.0 >Reporter: Michael Gummelt > > This is to enable kerberos support for other schedulers, such as Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20511: - Assignee: Felix Cheung > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung > Fix For: 2.2.0 > > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18864: -- Fix Version/s: 2.2.0 > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > Fix For: 2.2.0 > > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20511: -- Fix Version/s: 2.2.0 > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > Fix For: 2.2.0 > > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20508) Spark R 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20508: - Assignee: Felix Cheung (was: Joseph K. Bradley) > Spark R 2.2 QA umbrella > --- > > Key: SPARK-20508 > URL: https://issues.apache.org/jira/browse/SPARK-20508 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > Fix For: 2.2.0 > > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley deleted SPARK-20513: -- > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task >Reporter: Joseph K. Bradley >Priority: Critical > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20508: -- Fix Version/s: 2.2.0 > Spark R 2.2 QA umbrella > --- > > Key: SPARK-20508 > URL: https://issues.apache.org/jira/browse/SPARK-20508 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > Fix For: 2.2.0 > > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20510: -- Fix Version/s: 2.2.0 > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > Fix For: 2.2.0 > > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20510: - Assignee: Felix Cheung > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > Fix For: 2.2.0 > > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046843#comment-16046843 ] Joseph K. Bradley commented on SPARK-20513: --- whoops, i'll delete this... > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Fix Version/s: 2.2.0 > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > Fix For: 2.2.0 > > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-18864]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-20505]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20512: - Assignee: Felix Cheung > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > Fix For: 2.2.0 > > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-18864]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-20505]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046812#comment-16046812 ] Sean Owen commented on SPARK-21066: --- CC [~lian cheng] I don't immediately see why the relation can't examine multiple files to compute the number of features, either. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046798#comment-16046798 ] Dongjoon Hyun commented on SPARK-17642: --- Sorry, I cannot help you further with that because I'm not a Databricks' guy. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046794#comment-16046794 ] Maria commented on SPARK-17642: --- [~dongjoon], where can I read more on this? I found blog post [1] announcing built-in access control, but no more further specifics. [1] https://databricks.com/blog/2017/05/24/databricks-runtime-3-0-beta-delivers-enterprise-grade-apache-spark.html > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046740#comment-16046740 ] Shixiong Zhu commented on SPARK-20927: -- Do nothing except logging a warning > Add cache operator to Unsupported Operations in Structured Streaming > - > > Key: SPARK-20927 > URL: https://issues.apache.org/jira/browse/SPARK-20927 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > Just [found > out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries] > that {{cache}} is not allowed on streaming datasets. > {{cache}} on streaming datasets leads to the following exception: > {code} > scala> spark.readStream.text("files").cache > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with writeStream.start();; > FileSource[files] > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104) > at > org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68) > at > org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92) > at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603) > at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613) > ... 48 elided > {code} > It should be included in Structured Streaming's [Unsupported > Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046739#comment-16046739 ] Dongjoon Hyun commented on SPARK-17642: --- I see. Interesting. At a first glance, the comments seem to be related to Databricks 3.0 (beta) instead of Apache Spark 2.2. Some features in `com.databricks.sql.acl.HiveAclExtensions` and `com.databricks.spark.sql.acl.client.SparkSqlAclClient`? > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21046) simplify the array offset and length in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21046. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18260 [https://github.com/apache/spark/pull/18260] > simplify the array offset and length in ColumnVector > > > Key: SPARK-21046 > URL: https://issues.apache.org/jira/browse/SPARK-21046 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046718#comment-16046718 ] Maria commented on SPARK-17642: --- [~dongjoon], thanks for such a quick response. Attached PR [1] introduces 'DESC EXTENDED table-name column-name' SQL command to show column-level statistics: min/max, number of distinct values, histograms, etc. [~smilegator] raised a security concern: a user who doesn't have access to column C may still learn about the data in that column. "After rethinking about it, DESC EXTENDED/FORMATTED COLUMN discloses the data patterns/statistics info. These info are pretty sensitive. Not all the users should be allowed to access it. We might face the security-related complaints about this feature. ... Column-level security can block users to access the specific columns, but this command DESC EXTENDED/FORMATTED COLUMN might not be part of the design/solution." [1] https://github.com/apache/spark/pull/16422 > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21066: -- Description: Currently when we using SVM to train dataset we found the input files limit only one . The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. was: Currently when we using SVM to train dataset we found the input files limit only one . the source code as following : {{{ val path = if (dataFiles.length == 1) { dataFiles.head.getPath.toUri.toString } else if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { throw new IOException("Multiple input paths are not supported for libsvm data.") } }}} The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21066) LibSVM load just one input file
darion yaphet created SPARK-21066: - Summary: LibSVM load just one input file Key: SPARK-21066 URL: https://issues.apache.org/jira/browse/SPARK-21066 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.1 Reporter: darion yaphet Currently when we using SVM to train dataset we found the input files limit only one . the source code as following : {{{ val path = if (dataFiles.length == 1) { dataFiles.head.getPath.toUri.toString } else if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { throw new IOException("Multiple input paths are not supported for libsvm data.") } }}} The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046694#comment-16046694 ] Dongjoon Hyun commented on SPARK-17642: --- Hi, [~mbasmanova]. Spark-LLAP is a third party library. Is it related to this CBO issue in Apache Spark? > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046664#comment-16046664 ] Maria commented on SPARK-17642: --- Folks, It seems to me that column-level access control is implemented outside of Spark in spark-llap library. It also looks like the responsibility for making Spark SQL commands *secure* is the responsibility of the maintainers of LLAP. I'm making this assumption based on the following PRs which have hardened existing DESC table command: https://github.com/hortonworks-spark/spark-llap/pull/18 https://github.com/hortonworks-spark/spark-llap/pull/134 [~dongjoon], could you confirm? If that's the case, then wouldn't heads-up to [~dongjoon] be sufficient to unblock attached PR? > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker
[ https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-20715. Resolution: Fixed Fix Version/s: 2.3.0 Fixed for 2.3.0. > MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and > MapOutputTracker > > > Key: SPARK-20715 > URL: https://issues.apache.org/jira/browse/SPARK-20715 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Shuffle >Affects Versions: 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.3.0 > > > Today the MapOutputTracker and ShuffleMapStage both maintain their own copies > of MapStatuses. This creates the potential for bugs in case these two pieces > of state become out of sync. > I believe that we can improve our ability to reason about the code by storing > this information only in the MapOutputTracker. This can also help to reduce > driver memory consumption. > I will provide more details in my PR, where I'll walk through the detailed > arguments as to why we can take these two different metadata tracking formats > and consolidate without loss of performance or correctness. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046649#comment-16046649 ] Dan Dutrow commented on SPARK-21065: Something to note: If one batch's processing time exceeds the batch interval, then a second batch could begin before the first is complete. This is fine behavior for us, and offers the ability to catch up if something gets delayed, but may be confusing for the scheduler. > Spark Streaming concurrentJobs + StreamingJobProgressListener conflict > -- > > Key: SPARK-21065 > URL: https://issues.apache.org/jira/browse/SPARK-21065 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler >Affects Versions: 2.1.0 >Reporter: Dan Dutrow > > My streaming application has 200+ output operations, many of them stateful > and several of them windowed. In an attempt to reduce the processing times, I > set "spark.streaming.concurrentJobs" to 2+. Initial results are very > positive, cutting our processing time from ~3 minutes to ~1 minute, but > eventually we encounter an exception as follows: > Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a > batch from 45 minutes before the exception is thrown. > 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR > org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener > StreamingJobProgressListener threw an exception > java.util.NoSuchElementException: key not found 149697756 ms > at scala.collection.MalLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.mutable.HashMap.apply(HashMap.scala:65) > at > org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) > ... > The Spark code causing the exception is here: > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 > override def onOutputOperationCompleted( > outputOperationCompleted: StreamingListenerOutputOperationCompleted): > Unit = synchronized { > // This method is called before onBatchCompleted > {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} > updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) > } > It seems to me that it may be caused by that batch being removed earlier. > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 > override def onBatchCompleted(batchCompleted: > StreamingListenerBatchCompleted): Unit = { > synchronized { > waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) > > {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} > val batchUIData = BatchUIData(batchCompleted.batchInfo) > completedBatchUIData.enqueue(batchUIData) > if (completedBatchUIData.size > batchUIDataLimit) { > val removedBatch = completedBatchUIData.dequeue() > batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) > } > totalCompletedBatches += 1L > totalProcessedRecords += batchUIData.numRecords > } > } > What is the solution here? Should I make my spark streaming context remember > duration a lot longer? ssc.remember(batchDuration * rememberMultiple) > Otherwise, it seems like there should be some kind of existence check on > runningBatchUIData before dereferencing it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-21065: --- Component/s: DStreams > Spark Streaming concurrentJobs + StreamingJobProgressListener conflict > -- > > Key: SPARK-21065 > URL: https://issues.apache.org/jira/browse/SPARK-21065 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler >Affects Versions: 2.1.0 >Reporter: Dan Dutrow > > My streaming application has 200+ output operations, many of them stateful > and several of them windowed. In an attempt to reduce the processing times, I > set "spark.streaming.concurrentJobs" to 2+. Initial results are very > positive, cutting our processing time from ~3 minutes to ~1 minute, but > eventually we encounter an exception as follows: > Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a > batch from 45 minutes before the exception is thrown. > 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR > org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener > StreamingJobProgressListener threw an exception > java.util.NoSuchElementException: key not found 149697756 ms > at scala.collection.MalLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.mutable.HashMap.apply(HashMap.scala:65) > at > org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) > ... > The Spark code causing the exception is here: > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 > override def onOutputOperationCompleted( > outputOperationCompleted: StreamingListenerOutputOperationCompleted): > Unit = synchronized { > // This method is called before onBatchCompleted > {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} > updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) > } > It seems to me that it may be caused by that batch being removed earlier. > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 > override def onBatchCompleted(batchCompleted: > StreamingListenerBatchCompleted): Unit = { > synchronized { > waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) > > {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} > val batchUIData = BatchUIData(batchCompleted.batchInfo) > completedBatchUIData.enqueue(batchUIData) > if (completedBatchUIData.size > batchUIDataLimit) { > val removedBatch = completedBatchUIData.dequeue() > batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) > } > totalCompletedBatches += 1L > totalProcessedRecords += batchUIData.numRecords > } > } > What is the solution here? Should I make my spark streaming context remember > duration a lot longer? ssc.remember(batchDuration * rememberMultiple) > Otherwise, it seems like there should be some kind of existence check on > runningBatchUIData before dereferencing it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-21065: --- Description: My streaming application has 200+ output operations, many of them stateful and several of them windowed. In an attempt to reduce the processing times, I set "spark.streaming.concurrentJobs" to 2+. Initial results are very positive, cutting our processing time from ~3 minutes to ~1 minute, but eventually we encounter an exception as follows: Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a batch from 45 minutes before the exception is thrown. 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener StreamingJobProgressListener threw an exception java.util.NoSuchElementException: key not found 149697756 ms at scala.collection.MalLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) ... The Spark code causing the exception is here: https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 override def onOutputOperationCompleted( outputOperationCompleted: StreamingListenerOutputOperationCompleted): Unit = synchronized { // This method is called before onBatchCompleted {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) } It seems to me that it may be caused by that batch being removed earlier. https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = { synchronized { waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} val batchUIData = BatchUIData(batchCompleted.batchInfo) completedBatchUIData.enqueue(batchUIData) if (completedBatchUIData.size > batchUIDataLimit) { val removedBatch = completedBatchUIData.dequeue() batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) } totalCompletedBatches += 1L totalProcessedRecords += batchUIData.numRecords } } What is the solution here? Should I make my spark streaming context remember duration a lot longer? ssc.remember(batchDuration * rememberMultiple) Otherwise, it seems like there should be some kind of existence check on runningBatchUIData before dereferencing it. was: My streaming application has 200+ output operations, many of them stateful and several of them windowed. In an attempt to reduce the processing times, I set "spark.streaming.concurrentJobs" to 2+. Initial results are very positive, cutting our processing time from ~3 minutes to ~1 minute, but eventually we encounter an exception as follows: Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a batch from 45 minutes before the exception is thrown. 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener StreamingJobProgressListener threw an exception java.util.NoSuchElementException: key not found 149697756 ms at scala.collection.MalLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) ... The Spark code
[jira] [Commented] (SPARK-21061) GMM Error : Matrix is not symmetric
[ https://issues.apache.org/jira/browse/SPARK-21061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046631#comment-16046631 ] Sean Owen commented on SPARK-21061: --- I get a slightly different error: NotConvergedException. The problem is that some inputs to eigSym are NaN. After tracing it back a while it's because math.exp(logpdf(x)) in pdf(x) becomes Infinite, and that's likely because sigma is near-singular, so rootSigmaInv has huge values. I stopped before figuring out exactly what's wrong there, but that's pretty much the root. For your example, I think it's because you have too many cluster centers for this data. Try k=5. It shouldn't cause an error, but, I think you'll find you don't want nearly so many clusters anyway. > GMM Error : Matrix is not symmetric > --- > > Key: SPARK-21061 > URL: https://issues.apache.org/jira/browse/SPARK-21061 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04, Windows >Reporter: Sourav Mahato > > I am trying to implement Gaussian Mixture Model(GMM) using Apache Spark MLlib > (1.6.0). But I am getting the following error. > https://stackoverflow.com/questions/44494599/spark-mllib-gmm-error-matrix-is-not-symmetric > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bykov updated SPARK-21063: Affects Version/s: 2.1.0 > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046596#comment-16046596 ] Peter Bykov commented on SPARK-21063: - [~sowen] data available in table, also i have empty result if i will execute: {code:java} val df = spark.read .option("url", "jdbc:hive2://remote.hive.local:1/default") .option("user", "user") .option("password", "pass") .option("dbtable", "(SELECT 1 AS col) AS dft") .option("driver", "org.apache.hive.jdbc.HiveDriver") .format("jdbc") .load() {code} As you can see in my description, i receive table schema, but no data. This means that connection ok. I'm pointing right cluster, i checked link thru simple JDBC connection. Also, if i will execute query like: {code:java} val spark = SparkSession.builder .appName("RemoteSparkTest") .master("local") .config("hive.metastore.uris", "thrift://remote.hive.local:9083/default") .enableHiveSupport() .getOrCreate() val df = spark.sql("SELECT * FROM test_table") df.show() {code} Result: {noformat} ++ |test_col| ++ |test row| ++ {noformat} So, data available. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
Dan Dutrow created SPARK-21065: -- Summary: Spark Streaming concurrentJobs + StreamingJobProgressListener conflict Key: SPARK-21065 URL: https://issues.apache.org/jira/browse/SPARK-21065 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Dan Dutrow My streaming application has 200+ output operations, many of them stateful and several of them windowed. In an attempt to reduce the processing times, I set "spark.streaming.concurrentJobs" to 2+. Initial results are very positive, cutting our processing time from ~3 minutes to ~1 minute, but eventually we encounter an exception as follows: Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a batch from 45 minutes before the exception is thrown. 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener StreamingJobProgressListener threw an exception java.util.NoSuchElementException: key not found 149697756 ms at scala.collection.MalLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) ... The Spark code causing the exception is here: https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 override def onOutputOperationCompleted( outputOperationCompleted: StreamingListenerOutputOperationCompleted): Unit = synchronized { // This method is called before onBatchCompleted {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) } It seems to me that it may be caused by that batch being removed earlier. https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = { synchronized { waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} val batchUIData = BatchUIData(batchCompleted.batchInfo) completedBatchUIData.enqueue(batchUIData) if (completedBatchUIData.size > batchUIDataLimit) { val removedBatch = completedBatchUIData.dequeue() batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) } totalCompletedBatches += 1L totalProcessedRecords += batchUIData.numRecords } } What is the solution here? Should I make my spark streaming context remember duration a lot longer? ssc.remember(batchDuration * rememberDuration) Otherwise, it seems like there should be some kind of existence check on runningBatchUIData before dereferencing it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046581#comment-16046581 ] Sean Owen commented on SPARK-21063: --- Is there data in the table? are you pointing at the right cluster? are there other errors? etc. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046576#comment-16046576 ] Peter Bykov commented on SPARK-21063: - [~srowen] what do you mean by wrong config? what configuration should i check? > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()
[ https://issues.apache.org/jira/browse/SPARK-21041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21041. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 2.2.1 > With whole-stage codegen, SparkSession.range()'s behavior is inconsistent > with SparkContext.range() > --- > > Key: SPARK-21041 > URL: https://issues.apache.org/jira/browse/SPARK-21041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Kris Mok >Assignee: Dongjoon Hyun > Fix For: 2.2.1, 2.3.0 > > > When whole-stage codegen is enabled, in face of integer overflow, > SparkSession.range()'s behavior is inconsistent with when codegen is turned > off, while the latter is consistent with SparkContext.range()'s behavior. > The following Spark Shell session shows the inconsistency: > {code:java} > scala> sc.range >def range(start: Long,end: Long,step: Long,numSlices: Int): > org.apache.spark.rdd.RDD[Long] > scala> spark.range > > > def range(start: Long,end: Long,step: Long,numPartitions: Int): > org.apache.spark.sql.Dataset[Long] > def range(start: Long,end: Long,step: Long): > org.apache.spark.sql.Dataset[Long] > def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] > > def range(end: Long): org.apache.spark.sql.Dataset[Long] > scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, > 1).collect > res1: Array[Long] = Array() > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, > 9223372036854775806) > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res5: Array[Long] = Array() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046493#comment-16046493 ] Apache Spark commented on SPARK-21064: -- User 'djvulee' has created a pull request for this issue: https://github.com/apache/spark/pull/18280 > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: DjvuLee >Assignee: DjvuLee >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21019) read orc when some of the columns are missing in some files
[ https://issues.apache.org/jira/browse/SPARK-21019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046485#comment-16046485 ] Mahesh commented on SPARK-21019: Can you include the exception in your defect, from what I know spark uses orc implementation that is hosted at http://orc.apache.org/ You may want to try a similar thing with that library. I will try to run your code. > read orc when some of the columns are missing in some files > > > Key: SPARK-21019 > URL: https://issues.apache.org/jira/browse/SPARK-21019 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Modi Tamam > > I'm using Spark-2.1.1. > I'm experiencing an issue when I'm reading a bunch of ORC files when some of > the fields are missing from some of the files (file-1 has fields 'a' and 'b', > file-2 has fields 'a' and 'c'). When I'm running the same flow with JSON > files format, every thing is just fine (you can see it at the code snippets , > if you'll run it...) My question is whether it's a bug or an expected > behavior? > I'v pushed a maven project, ready for run, you can find it here > https://github.com/MordechaiTamam/spark-orc-issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21064: - Assignee: DjvuLee Priority: Trivial (was: Major) Component/s: (was: Spark Core) Tests > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: DjvuLee >Assignee: DjvuLee >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046478#comment-16046478 ] Apache Spark commented on SPARK-21064: -- User 'djvulee' has created a pull request for this issue: https://github.com/apache/spark/pull/18279 > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21064: Assignee: (was: Apache Spark) > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21064: Assignee: Apache Spark > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DjvuLee >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046476#comment-16046476 ] Chenzhao Guo commented on SPARK-20927: -- What exactly is 'no-op' ? Does that mean scala `NotImplementedError`, or could you give an example? > Add cache operator to Unsupported Operations in Structured Streaming > - > > Key: SPARK-20927 > URL: https://issues.apache.org/jira/browse/SPARK-20927 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > Just [found > out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries] > that {{cache}} is not allowed on streaming datasets. > {{cache}} on streaming datasets leads to the following exception: > {code} > scala> spark.readStream.text("files").cache > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with writeStream.start();; > FileSource[files] > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104) > at > org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68) > at > org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92) > at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603) > at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613) > ... 48 elided > {code} > It should be included in Structured Streaming's [Unsupported > Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046475#comment-16046475 ] DjvuLee commented on SPARK-21064: - The defalut value for `spark.port.maxRetries` is 100, but we use the 10 in the suite file. > Fix the default value bug in NettyBlockTransferServiceSuite > --- > > Key: SPARK-21064 > URL: https://issues.apache.org/jira/browse/SPARK-21064 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite
DjvuLee created SPARK-21064: --- Summary: Fix the default value bug in NettyBlockTransferServiceSuite Key: SPARK-21064 URL: https://issues.apache.org/jira/browse/SPARK-21064 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: DjvuLee -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21058) potential SVD optimization
[ https://issues.apache.org/jira/browse/SPARK-21058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046424#comment-16046424 ] Sean Owen commented on SPARK-21058: --- I think we discussed this separately. The Gramian method is indeed appropriate for tall, skinny matrices. Just because computing the Gramian is the hotspot doesn't mean there's a faster way. How would you efficiently compute the SVD of any huge matrix? If the matrix isn't huge, you can pull it into memory and decompose it with a library, sure. But that's not a Spark problem. There's also a different SVD implementation in GraphX. No, I would not start work on anything until you've addressed the above. > potential SVD optimization > -- > > Key: SPARK-21058 > URL: https://issues.apache.org/jira/browse/SPARK-21058 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Vincent > > In the current implementation, computeSVD will compute SVD for matrix A by > computing AT*A first and svd on the Gramian matrix, we found that the Gramian > matrix computation is the hot spot of the overall SVD computation. While svd > on the Gramian matrix can benefit svd computation on the skinny matrix, for a > non-skinny matrix, it could also become a huge overhead. So, is it possible > to offer another option by computing svd on the original matrix instead of > the Gramian matrix? We can decide which way to go by the ratio between > numCols and numRows, or by simply settings from the user. > We have observed a handsome gain on a toy dataset by svd on the original > matrix instead of the Gramian matrix, if the proposal is acceptable, we will > start to work on the patch and gather more performance data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation
[ https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20947: Assignee: Apache Spark > Encoding/decoding issue in PySpark pipe implementation > -- > > Key: SPARK-20947 > URL: https://issues.apache.org/jira/browse/SPARK-20947 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, > 2.1.1 >Reporter: Xiaozhe Wang >Assignee: Apache Spark > Attachments: fix-pipe-encoding-error.patch > > > Pipe action convert objects into strings using a way that was affected by the > default encoding setting of Python environment. > Here is the related code fragment (L717-721@python/pyspark/rdd.py): > {code} > def pipe_objs(out): > for obj in iterator: > s = str(obj).rstrip('\n') + '\n' > out.write(s.encode('utf-8')) > out.close() > {code} > The `str(obj)` part implicitly convert `obj` to an unicode string, then > encode it into a byte string using default encoding; On the other hand, the > `s.encode('utf-8')` part implicitly decode `s` into an unicode string using > default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string. > Typically the default encoding of Python environment would be 'ascii', which > means passing an unicode string containing characters beyond 'ascii' charset > will raise UnicodeEncodeError exception at `str(obj)` and passing a byte > string containing bytes greater than 128 will again raise UnicodeEncodeError > exception at 's.encode('utf-8')`. > Changing `str(obj)` to `unicode(obj)` would eliminate these problems. > The following code snippet reproduces these errors: > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.3 > /_/ > Using Python version 2.7.12 (default, Jul 25 2016 15:06:45) > SparkContext available as sc, HiveContext available as sqlContext. > >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect() > [Stage 0:> (0 + 4) / > 4]Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 719, in pipe_objs > s = str(obj).rstrip('\n') + '\n' > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: > ordinal not in range(128) > >>> > >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: > >>> x.encode('utf-8')).pipe('cat').collect() > Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 720, in pipe_objs > out.write(s.encode('utf-8')) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: > ordinal not in range(128) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation
[ https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046420#comment-16046420 ] Apache Spark commented on SPARK-20947: -- User 'chaoslawful' has created a pull request for this issue: https://github.com/apache/spark/pull/18277 > Encoding/decoding issue in PySpark pipe implementation > -- > > Key: SPARK-20947 > URL: https://issues.apache.org/jira/browse/SPARK-20947 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, > 2.1.1 >Reporter: Xiaozhe Wang > Attachments: fix-pipe-encoding-error.patch > > > Pipe action convert objects into strings using a way that was affected by the > default encoding setting of Python environment. > Here is the related code fragment (L717-721@python/pyspark/rdd.py): > {code} > def pipe_objs(out): > for obj in iterator: > s = str(obj).rstrip('\n') + '\n' > out.write(s.encode('utf-8')) > out.close() > {code} > The `str(obj)` part implicitly convert `obj` to an unicode string, then > encode it into a byte string using default encoding; On the other hand, the > `s.encode('utf-8')` part implicitly decode `s` into an unicode string using > default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string. > Typically the default encoding of Python environment would be 'ascii', which > means passing an unicode string containing characters beyond 'ascii' charset > will raise UnicodeEncodeError exception at `str(obj)` and passing a byte > string containing bytes greater than 128 will again raise UnicodeEncodeError > exception at 's.encode('utf-8')`. > Changing `str(obj)` to `unicode(obj)` would eliminate these problems. > The following code snippet reproduces these errors: > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.3 > /_/ > Using Python version 2.7.12 (default, Jul 25 2016 15:06:45) > SparkContext available as sc, HiveContext available as sqlContext. > >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect() > [Stage 0:> (0 + 4) / > 4]Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 719, in pipe_objs > s = str(obj).rstrip('\n') + '\n' > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: > ordinal not in range(128) > >>> > >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: > >>> x.encode('utf-8')).pipe('cat').collect() > Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 720, in pipe_objs > out.write(s.encode('utf-8')) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: > ordinal not in range(128) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation
[ https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20947: Assignee: (was: Apache Spark) > Encoding/decoding issue in PySpark pipe implementation > -- > > Key: SPARK-20947 > URL: https://issues.apache.org/jira/browse/SPARK-20947 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, > 2.1.1 >Reporter: Xiaozhe Wang > Attachments: fix-pipe-encoding-error.patch > > > Pipe action convert objects into strings using a way that was affected by the > default encoding setting of Python environment. > Here is the related code fragment (L717-721@python/pyspark/rdd.py): > {code} > def pipe_objs(out): > for obj in iterator: > s = str(obj).rstrip('\n') + '\n' > out.write(s.encode('utf-8')) > out.close() > {code} > The `str(obj)` part implicitly convert `obj` to an unicode string, then > encode it into a byte string using default encoding; On the other hand, the > `s.encode('utf-8')` part implicitly decode `s` into an unicode string using > default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string. > Typically the default encoding of Python environment would be 'ascii', which > means passing an unicode string containing characters beyond 'ascii' charset > will raise UnicodeEncodeError exception at `str(obj)` and passing a byte > string containing bytes greater than 128 will again raise UnicodeEncodeError > exception at 's.encode('utf-8')`. > Changing `str(obj)` to `unicode(obj)` would eliminate these problems. > The following code snippet reproduces these errors: > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.3 > /_/ > Using Python version 2.7.12 (default, Jul 25 2016 15:06:45) > SparkContext available as sc, HiveContext available as sqlContext. > >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect() > [Stage 0:> (0 + 4) / > 4]Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 719, in pipe_objs > s = str(obj).rstrip('\n') + '\n' > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: > ordinal not in range(128) > >>> > >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: > >>> x.encode('utf-8')).pipe('cat').collect() > Exception in thread Thread-1: > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 801, in __bootstrap_inner > self.run() > File > "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", > line 754, in run > self.__target(*self.__args, **self.__kwargs) > File > "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line > 720, in pipe_objs > out.write(s.encode('utf-8')) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: > ordinal not in range(128) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21057) Do not use a PascalDistribution in countApprox
[ https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21057: Assignee: (was: Apache Spark) > Do not use a PascalDistribution in countApprox > -- > > Key: SPARK-21057 > URL: https://issues.apache.org/jira/browse/SPARK-21057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa > > I was reading the source of Spark, and found this: > https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72 > This is the function that estimates the probability distribution of the total > count of elements in an RDD given the count of only some partitions. > This function does a strange thing: when the number of elements counted so > far is less than 10 000, it models the total count with a negative binomial > (Pascal) law, else, it models it with a Poisson law. > Modeling our number of uncounted elements with a negative binomial law is > like saying that we ran over elements, counting only some, and stopping after > having counted a given number of elements. > But this does not model what really happened. Our counting was limited in > time, not in number of counted elements, and we can't count only some of the > elements in a partition. > I propose to use the Poisson distribution in every case, as it can be > justified under the hypothesis that the number of elements in each partition > is independent and follows a Poisson law. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21057) Do not use a PascalDistribution in countApprox
[ https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21057: Assignee: Apache Spark > Do not use a PascalDistribution in countApprox > -- > > Key: SPARK-21057 > URL: https://issues.apache.org/jira/browse/SPARK-21057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa >Assignee: Apache Spark > > I was reading the source of Spark, and found this: > https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72 > This is the function that estimates the probability distribution of the total > count of elements in an RDD given the count of only some partitions. > This function does a strange thing: when the number of elements counted so > far is less than 10 000, it models the total count with a negative binomial > (Pascal) law, else, it models it with a Poisson law. > Modeling our number of uncounted elements with a negative binomial law is > like saying that we ran over elements, counting only some, and stopping after > having counted a given number of elements. > But this does not model what really happened. Our counting was limited in > time, not in number of counted elements, and we can't count only some of the > elements in a partition. > I propose to use the Poisson distribution in every case, as it can be > justified under the hypothesis that the number of elements in each partition > is independent and follows a Poisson law. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox
[ https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046402#comment-16046402 ] Apache Spark commented on SPARK-21057: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/18276 > Do not use a PascalDistribution in countApprox > -- > > Key: SPARK-21057 > URL: https://issues.apache.org/jira/browse/SPARK-21057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa > > I was reading the source of Spark, and found this: > https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72 > This is the function that estimates the probability distribution of the total > count of elements in an RDD given the count of only some partitions. > This function does a strange thing: when the number of elements counted so > far is less than 10 000, it models the total count with a negative binomial > (Pascal) law, else, it models it with a Poisson law. > Modeling our number of uncounted elements with a negative binomial law is > like saying that we ran over elements, counting only some, and stopping after > having counted a given number of elements. > But this does not model what really happened. Our counting was limited in > time, not in number of counted elements, and we can't count only some of the > elements in a partition. > I propose to use the Poisson distribution in every case, as it can be > justified under the hypothesis that the number of elements in each partition > is independent and follows a Poisson law. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046395#comment-16046395 ] Sean Owen commented on SPARK-21063: --- Nothing about this suggests a Spark problem. You may not have the data you think; you may have the wrong config. This works fine on my cluster and my data. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org