[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15689: Labels: releasenotes (was: ) > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18350: Labels: releasenotes (was: ) > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18352: Labels: releasenotes (was: ) > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16475) Broadcast Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16475: Labels: releasenotes (was: ) > Broadcast Hint for SQL Queries > -- > > Key: SPARK-16475 > URL: https://issues.apache.org/jira/browse/SPARK-16475 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin > Labels: releasenotes > Attachments: BroadcastHintinSparkSQL.pdf > > > Broadcast hint is a way for users to manually annotate a query and suggest to > the query optimizer the join method. It is very useful when the query > optimizer cannot make optimal decision with respect to join methods due to > conservativeness or the lack of proper statistics. > The DataFrame API has broadcast hint since Spark 1.5. However, we do not have > an equivalent functionality in SQL queries. We propose adding Hive-style > broadcast hint to Spark SQL. > For more information, please see the attached document. One note about the > doc: in addition to supporting "MAPJOIN", we should also support > "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be > accepted: > {code} > SELECT /*+ MAPJOIN(b) */ ... > SELECT /*+ BROADCASTJOIN(b) */ ... > SELECT /*+ BROADCAST(b) */ ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17897: Labels: correctness (was: ) > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > Labels: correctness > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704547#comment-15704547 ] Xiao Li commented on SPARK-17897: - Actually, the fix is super simple. Just one line. > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18618) SparkR model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704522#comment-15704522 ] Yanbo Liang commented on SPARK-18618: - cc [~josephkb] [~felixcheung] > SparkR model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR model {{predict}} should support {{type}} as a argument. This will it > consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18618) SparkR model predict should support type as a argument
Yanbo Liang created SPARK-18618: --- Summary: SparkR model predict should support type as a argument Key: SPARK-18618 URL: https://issues.apache.org/jira/browse/SPARK-18618 Project: Spark Issue Type: Improvement Components: ML, SparkR Reporter: Yanbo Liang SparkR model {{predict}} should support {{type}} as a argument. This will it consistent with native R predict such as https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-18587. --- Resolution: Won't Fix > Remove handleInvalid from QuantileDiscretizer > - > > Key: SPARK-18587 > URL: https://issues.apache.org/jira/browse/SPARK-18587 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > > HandleInvalid only happens when {{Bucketizer}} transforming a dataset which > contains NaN, however, when the training dataset containing NaN, > {{QuantileDiscretizer}} will always ignore them. So we should keep > {{handleInvalid}} in {{Bucketizer}} and remove it from > {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time
[ https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704507#comment-15704507 ] Denis Bolshakov commented on SPARK-16986: - [~srowen], could you please specify why the ticket was closed? I understand why the PR was rejected, but in fact, the issue still exists (spark 2.0.2). Kind regards, Denis > "Started" time, "Completed" time and "Last Updated" time in history server UI > are not user local time > - > > Key: SPARK-16986 > URL: https://issues.apache.org/jira/browse/SPARK-16986 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Weiqing Yang >Priority: Minor > > Currently, "Started" time, "Completed" time and "Last Updated" time in > history server UI are GMT. They should be the user local time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704486#comment-15704486 ] Xiao Li commented on SPARK-17897: - I can reproduce it. Will fix it tomorrow. Thanks for reporting this! > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18616) Pure Python Implementation of MLWritable for use in Pipeline
[ https://issues.apache.org/jira/browse/SPARK-18616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704472#comment-15704472 ] Nick Pentreath commented on SPARK-18616: Just a note that generally committers set Target Version. Thanks! > Pure Python Implementation of MLWritable for use in Pipeline > > > Key: SPARK-18616 > URL: https://issues.apache.org/jira/browse/SPARK-18616 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 > Environment: pyspark >Reporter: Andrea Matsunaga > > When developing an estimator and model completely in python, it is possible > to implement the save() function, and it works for a standalone model, but > not when added to a Pipeline. The reason is that Pipeline save implementation > forces the use of JavaMLWritable, thus also requiring the object to have > methods that are meaningful only to Java objects. Pipelines implementation > need to have a check for the type of writable object defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18616) Pure Python Implementation of MLWritable for use in Pipeline
[ https://issues.apache.org/jira/browse/SPARK-18616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18616: --- Target Version/s: (was: 2.0.2) > Pure Python Implementation of MLWritable for use in Pipeline > > > Key: SPARK-18616 > URL: https://issues.apache.org/jira/browse/SPARK-18616 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 > Environment: pyspark >Reporter: Andrea Matsunaga > > When developing an estimator and model completely in python, it is possible > to implement the save() function, and it works for a standalone model, but > not when added to a Pipeline. The reason is that Pipeline save implementation > forces the use of JavaMLWritable, thus also requiring the object to have > methods that are meaningful only to Java objects. Pipelines implementation > need to have a check for the type of writable object defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-18339. --- Resolution: Fixed Fix Version/s: 2.1.0 > Don't push down current_timestamp for filters in StructuredStreaming > > > Key: SPARK-18339 > URL: https://issues.apache.org/jira/browse/SPARK-18339 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Assignee: Tyson Condie >Priority: Critical > Fix For: 2.1.0 > > > For the following workflow: > 1. I have a column called time which is at minute level precision in a > Streaming DataFrame > 2. I want to perform groupBy time, count > 3. Then I want my MemorySink to only have the last 30 minutes of counts and I > perform this by > {code} > .where('time >= current_timestamp().cast("long") - 30 * 60) > {code} > what happens is that the `filter` gets pushed down before the aggregation, > and the filter happens on the source data for the aggregation instead of the > result of the aggregation (where I actually want to filter). > I guess the main issue here is that `current_timestamp` is non-deterministic > in the streaming context and shouldn't be pushed down the filter. > Does this require us to store the `current_timestamp` for each trigger of the > streaming job, that is something to discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18513) Record and recover watermark
[ https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-18513. --- Resolution: Fixed Fix Version/s: 2.1.0 > Record and recover watermark > > > Key: SPARK-18513 > URL: https://issues.apache.org/jira/browse/SPARK-18513 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Liwei Lin >Assignee: Tyson Condie >Priority: Blocker > Fix For: 2.1.0 > > > We should record the watermark into the persistent log and recover it to > ensure determinism. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704407#comment-15704407 ] Apache Spark commented on SPARK-17931: -- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/16053 > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704372#comment-15704372 ] Cody Koeninger commented on SPARK-18506: 1 x spark master is m3 medium 2 x spark workers are m3 xlarge Looking back at that particular testing setup, kafka and ZK were on a single m3 large, which is admittedly unrealistic. I'm a little too busy at the moment to try again with a more realistic setup though. > kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a > single partition on a multi partition topic > --- > > Key: SPARK-18506 > URL: https://issues.apache.org/jira/browse/SPARK-18506 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark > standalone mode 2.0.2 > with Kafka 0.10.1.0. >Reporter: Heji Kim > > Our team is trying to upgrade to Spark 2.0.2/Kafka > 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our > drivers to read all partitions of a single stream when kafka > auto.offset.reset=earliest running on a real cluster(separate VM nodes). > When we run our drivers with auto.offset.reset=latest ingesting from a single > kafka topic with multiple partitions (usually 10 but problem shows up with > only 3 partitions), the driver reads correctly from all partitions. > Unfortunately, we need "earliest" for exactly once semantics. > In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using > spark-streaming-kafka-0-8_2.11 with the prior setting > auto.offset.reset=smallest runs correctly. > We have tried the following configurations in trying to isolate our problem > but it is only auto.offset.reset=earliest on a "real multi-machine cluster" > which causes this problem. > 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each) > instead of YARN 2.7.3. Single partition read problem persists both cases. > Please note this problem occurs on an actual cluster of separate VM nodes > (but not when our engineer runs in as a cluster on his own Mac.) > 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists. > 3. Turned off checkpointing. Problem persists with or without checkpointing. > 4. Turned off backpressure. Problem persists with or without backpressure. > 5. Tried both partition.assignment.strategy RangeAssignor and > RoundRobinAssignor. Broken with both. > 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with > both. > 7. Tried the simplest scala driver that only logs. (Our team uses java.) > Broken with both. > 8. Tried increasing GCE capacity for cluster but already we were highly > overprovisioned for cores and memory. Also tried ramping up executors and > cores. Since driver works with auto.offset.reset=latest, we have ruled out > GCP cloud infrastructure issues. > When we turn on the debug logs, we sometimes see partitions being set to > different offset configuration even though the consumer config correctly > indicates auto.offset.reset=earliest. > {noformat} > 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 TRACE Sending ListOffsetRequest > {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]} > to broker 10.102.20.12:9092 (id: 12 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 TRACE Sending ListOffsetRequest > {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]} > to broker 10.102.20.13:9092 (id: 13 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 TRACE Received ListOffsetResponse > {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]} > from broker 10.102.20.12:9092 (id: 12 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 TRACE Received ListOffsetResponse > {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]} > from broker 10.102.20.13:9092 (id: 13 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 > (org.apache.kafka.clients.consumer.internals.Fetcher) > {noformat} > I've enclosed below the completely stripped down trivial test driver that > shows this behavior. Aft
[jira] [Assigned] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18617: Assignee: Apache Spark > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Assignee: Apache Spark > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18617: Assignee: (was: Apache Spark) > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704357#comment-15704357 ] Apache Spark commented on SPARK-18617: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16052 > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704347#comment-15704347 ] Xiao Li commented on SPARK-17897: - Let me try to reproduce it in master > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17680: Assignee: Apache Spark (was: Xiao Li) > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
Genmao Yu created SPARK-18617: - Summary: Close "kryo auto pick" feature for Spark Streaming Key: SPARK-18617 URL: https://issues.apache.org/jira/browse/SPARK-18617 Project: Spark Issue Type: Bug Affects Versions: 2.0.2 Reporter: Genmao Yu [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to fix the bug, i.e. {{receiver data can not be deserialized properly}}. As [~zsxwing] said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close {{auto pick kryo serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17680: Assignee: Xiao Li (was: Apache Spark) > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18590: Issue Type: New Feature (was: Bug) > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18590: Priority: Major (was: Blocker) > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18590: Target Version/s: (was: 2.1.0) > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18590: Target Version/s: 2.1.0 > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17949: Labels: releasenotes (was: ) > Introduce a JVM object based aggregate operator > --- > > Key: SPARK-17949 > URL: https://issues.apache.org/jira/browse/SPARK-17949 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Lian > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf > > > The new Tungsten execution engine has very robust memory management and speed > for simple data types. It does, however, suffer from the following: > # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is > fairly expensive to fit into the Tungsten internal format. > # For aggregate functions that require complex intermediate data structures, > Unsafe (on raw bytes) is not a good programming abstraction due to the lack > of structs. > The idea here is to introduce a JVM object based hash aggregate operator that > can support the aforementioned use cases. This operator, however, should > limit its memory usage to avoid putting too much pressure on GC, e.g. falling > back to sort-based aggregate as soon the number of objects exceeds a very low > threshold. > Internally at Databricks we prototyped a version of this for a customer POC > and have observed substantial speed-ups over existing Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18566) remove OverwriteOptions
[ https://issues.apache.org/jira/browse/SPARK-18566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-18566: Target Version/s: 2.2.0 (was: 2.1.0) > remove OverwriteOptions > --- > > Key: SPARK-18566 > URL: https://issues.apache.org/jira/browse/SPARK-18566 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14543: Target Version/s: 2.2.0 (was: 2.1.0) > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704335#comment-15704335 ] Reynold Xin commented on SPARK-17897: - cc [~cloud_fan], [~smilegator], [~hvanhovell] > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18544: Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Priority: Blocker > Fix For: 2.1.0 > > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test") > scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS > (path '/tmp/test') PARTITIONED BY (A, B)") > scala> sql("msck repair table test") > scala> sql("select * from test where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").saveAsTable("test") > scala> sql("select * from test where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18544: Assignee: Eric Liang > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Blocker > Fix For: 2.1.0 > > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test") > scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS > (path '/tmp/test') PARTITIONED BY (A, B)") > scala> sql("msck repair table test") > scala> sql("select * from test where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").saveAsTable("test") > scala> sql("select * from test where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18544. - Resolution: Fixed Fix Version/s: 2.1.0 > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Priority: Blocker > Fix For: 2.1.0 > > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test") > scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS > (path '/tmp/test') PARTITIONED BY (A, B)") > scala> sql("msck repair table test") > scala> sql("select * from test where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").saveAsTable("test") > scala> sql("select * from test where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18547) Decouple I/O encryption key propagation from UserGroupInformation
[ https://issues.apache.org/jira/browse/SPARK-18547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18547. -- Resolution: Fixed Fix Version/s: 2.1.0 > Decouple I/O encryption key propagation from UserGroupInformation > - > > Key: SPARK-18547 > URL: https://issues.apache.org/jira/browse/SPARK-18547 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.1.0 > > > Currently, the encryption key used by the shuffle code is propagated using > {{UserGroupInformation}} and thus only works on YARN. That makes it really > painful to write unit tests in core that include encryption functionality. > We should change that so that writing these tests is possible, and also > because that would allow shuffle encryption to work with other cluster > managers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18588: Assignee: Shixiong Zhu (was: Apache Spark) > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704132#comment-15704132 ] Apache Spark commented on SPARK-18588: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16051 > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18588: Assignee: Apache Spark (was: Shixiong Zhu) > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Apache Spark > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12307) ParquetFormat options should be exposed through the DataFrameReader/Writer options API
[ https://issues.apache.org/jira/browse/SPARK-12307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704121#comment-15704121 ] Hyukjin Kwon commented on SPARK-12307: -- Hi [~holdenk], I guess we are able to set this via {{option(..)}}. Could we resolve this JIRA as a duplicate (or subset) of SPARK-14913? > ParquetFormat options should be exposed through the DataFrameReader/Writer > options API > -- > > Key: SPARK-12307 > URL: https://issues.apache.org/jira/browse/SPARK-12307 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: holdenk >Priority: Trivial > > Currently many options for loading/saving Parquet need to be set globally on > the SparkContext. It would be useful to also provide support for setting > these options through the DataFrameReader/DataFrameWriter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2363) Clean MLlib's sample data files
[ https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704117#comment-15704117 ] Apache Spark commented on SPARK-2363: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1394 > Clean MLlib's sample data files > --- > > Key: SPARK-2363 > URL: https://issues.apache.org/jira/browse/SPARK-2363 > Project: Spark > Issue Type: Task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Sean Owen >Priority: Minor > Fix For: 1.1.0 > > > MLlib has sample data under serveral folders: > 1) data/mllib > 2) data/ > 3) mllib/data/* > Per previous discussion with [~matei], we want to put them under `data/mllib` > and clean outdated files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18523) OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18523. - Resolution: Fixed Assignee: Alexander Shorin Fix Version/s: 2.1.0 > OOM killer may leave SparkContext in broken state causing Connection Refused > errors > --- > > Key: SPARK-18523 > URL: https://issues.apache.org/jira/browse/SPARK-18523 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Alexander Shorin >Assignee: Alexander Shorin > Fix For: 2.1.0 > > > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You cannot start new SparkContext because you have your broken one as active > one and pyspark still is not able to not have SparkContext as sort of > singleton. > The only thing you can do is shutdown your IPython Notebook and start it > over. Or dive into SparkContext internal attributes and reset them manually > to initial None state. > The OOM killer case is just one of the many: any reason of spark-submit crash > in the middle of something leaves SparkContext in a broken state. > Example on error log on {{sc.stop()}} in broken state: > {code} > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 883, in send_command > response = connection.send_command(command) > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 1040, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:59911) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 963, in start > self.socket.connect((self.address, self.port)) > File "/usr/local/lib/python2.7/socket.py", line 224, in meth > return getattr(self._sock,name)(*args) > error: [Errno 61] Connection refused > --- > Py4JError Traceback (most recent call last) > in () > > 1 sc.stop() > /usr/local/share/spark/python/pyspark/context.py in stop(self) > 360 """ > 361 if getattr(self, "_jsc", None): > --> 362 self._jsc.stop() > 363 self._jsc = None > 364 if getattr(self, "_accumulatorServer", None): > /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 43 def deco(*a, **kw): > 44 try: > ---> 45 return f(*a, **kw) > 46 except py4j.protocol.Py4JJavaError as e: > 47 s = e.java_exception.toString() > /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in > get_return_value(answer, gateway_client, target_id, name) > 325 raise Py4JError( > 326 "An error occurred while calling {0}{1}{2}". > --> 327 format(target_id, ".", name)) > 328 else: > 329 type = answer[1] > Py4JError: An error occurred while calling o47.stop > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-17680: --- Assignee: Xiao Li (was: Kazuaki Ishizaki) > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17905) Added test cases for InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17905. - Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.1.0 > Added test cases for InMemoryRelation > - > > Key: SPARK-17905 > URL: https://issues.apache.org/jira/browse/SPARK-17905 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > Added test cases for InMemoryRelation for the following cases > - keep all data types with null or without null > - access only some columns in {{CachedBatch}} > - access {{CachedBatch}} disabling whole stage codegen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-17680: - > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703909#comment-15703909 ] Jerryjung commented on SPARK-18220: --- Sure, It was working in Spark 1.X. Even the 2.0.3 version works fine. As mentioned above, errors only occur in version 2.1.0. > ClassCastException occurs when using select query on ORC file > - > > Key: SPARK-18220 > URL: https://issues.apache.org/jira/browse/SPARK-18220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jerryjung > Labels: orcfile, sql > > Error message is below. > {noformat} > == > 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from > hdfs://xxx/part-00022 with {include: [true], offset: 0, length: > 9223372036854775807} > 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). > 1220 bytes result sent to driver > 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID > 42) in 116 ms on localhost (executor driver) (19/20) > 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID > 35) > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ORC dump info. > == > File Version: 0.12 with HIVE_8732 > 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from > hdfs://XXX/part-0 with {include: null, offset: 0, length: > 9223372036854775807} > 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on > read. Using file schema. > Rows: 7 > Compression: ZLIB > Compression size: 262144 > Type: > struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703897#comment-15703897 ] Jerryjung commented on SPARK-18220: --- Same error occurred! spark-sql> CREATE TABLE zz as select * from d_c.dcoc_ircs_op_brch; 16/11/29 11:09:28 INFO SparkSqlParser: Parsing command: CREATE TABLE zz as select * from d_c.dcoc_ircs_op_brch 16/11/29 11:09:28 INFO HiveMetaStore: 0: get_database: d_c 16/11/29 11:09:28 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_database: d_c 16/11/29 11:09:28 INFO HiveMetaStore: 0: get_table : db=d_c tbl=dcoc_ircs_op_brch 16/11/29 11:09:28 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_table : db=d_c tbl=dcoc_ircs_op_brch 16/11/29 11:09:28 INFO HiveMetaStore: 0: get_table : db=d_c tbl=dcoc_ircs_op_brch 16/11/29 11:09:28 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_table : db=d_c tbl=dcoc_ircs_op_brch 16/11/29 11:09:28 INFO CatalystSqlParser: Parsing command: varchar(6) 16/11/29 11:09:28 INFO CatalystSqlParser: Parsing command: varchar(50) 16/11/29 11:09:28 INFO CatalystSqlParser: Parsing command: varchar(4) 16/11/29 11:09:28 INFO CatalystSqlParser: Parsing command: varchar(50) 16/11/29 11:09:28 INFO CatalystSqlParser: Parsing command: timestamp 16/11/29 11:09:30 INFO HiveMetaStore: 0: get_table : db=default tbl=zz 16/11/29 11:09:30 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_table : db=default tbl=zz 16/11/29 11:09:30 INFO HiveMetaStore: 0: get_database: default 16/11/29 11:09:30 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_database: default 16/11/29 11:09:30 INFO HiveMetaStore: 0: get_database: default 16/11/29 11:09:30 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_database: default 16/11/29 11:09:30 INFO HiveMetaStore: 0: get_table : db=default tbl=zz 16/11/29 11:09:30 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_table : db=default tbl=zz 16/11/29 11:09:30 INFO HiveMetaStore: 0: get_database: default 16/11/29 11:09:30 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_database: default 16/11/29 11:09:30 INFO HiveMetaStore: 0: create_table: Table(tableName:zz, dbName:default, owner:hadoop, createTime:1480385368, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:ircs_op_brch_cd, type:string, comment:null), FieldSchema(name:ircs_op_brch_nm, type:string, comment:null), FieldSchema(name:cms_brch_cd, type:string, comment:null), FieldSchema(name:cms_brch_nm, type:string, comment:null), FieldSchema(name:etl_job_dtm, type:timestamp, comment:null)], location:hdfs://xxx/user/hive/warehouse/zz, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"ircs_op_brch_cd","type":"string","nullable":true,"metadata":{}},{"name":"ircs_op_brch_nm","type":"string","nullable":true,"metadata":{}},{"name":"cms_brch_cd","type":"string","nullable":true,"metadata":{}},{"name":"cms_brch_nm","type":"string","nullable":true,"metadata":{}},{"name":"etl_job_dtm","type":"timestamp","nullable":true,"metadata":{}}]}, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=hive}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null)) ... parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"ircs_op_brch_cd","type":"string","nullable":true,"metadata":{}},{"name":"ircs_op_brch_nm","type":"string","nullable":true,"metadata":{}},{"name":"cms_brch_cd","type":"string","nullable":true,"metadata":{}},{"name":"cms_brch_nm","type":"string","nullable":true,"metadata":{}},{"name":"etl_job_dtm","type":"timestamp","nullable":true,"metadata":{}}]}, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=hive}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null)) ... 16/11/29 11:09:30 INFO FileUtils: Creating directory if it doesn't exist: hdfs://xxx/user/hive/warehouse/zz 16/11/29 11:09:31 INFO HiveMetaStore: 0: get_table : db=default tbl=zz 16/11/29 11:09:31 INFO audit: ugi=hadoopip=unknown-ip-addr cmd=get_table : db=default tbl=zz 16/11/29 11:09:31 INFO CatalystSqlParser: Parsing command: string 16/11/29 11:09:31 INFO CatalystSqlParser: Parsing command: string 16/11/29 11:09:31 INFO Ca
[jira] [Created] (SPARK-18616) Pure Python Implementation of MLWritable for use in Pipeline
Andrea Matsunaga created SPARK-18616: Summary: Pure Python Implementation of MLWritable for use in Pipeline Key: SPARK-18616 URL: https://issues.apache.org/jira/browse/SPARK-18616 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.0.2 Environment: pyspark Reporter: Andrea Matsunaga When developing an estimator and model completely in python, it is possible to implement the save() function, and it works for a standalone model, but not when added to a Pipeline. The reason is that Pipeline save implementation forces the use of JavaMLWritable, thus also requiring the object to have methods that are meaningful only to Java objects. Pipelines implementation need to have a check for the type of writable object defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks
[ https://issues.apache.org/jira/browse/SPARK-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18615: Assignee: Apache Spark > Switch to multi-line doc to avoid a genjavadoc bug for backticks > > > Key: SPARK-18615 > URL: https://issues.apache.org/jira/browse/SPARK-18615 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > I suspect this is related with SPARK-16153 and genjavadoc issue in > https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure. > Currently, single line comment does not mark down backticks to > {{..}} but prints as they are. For example, the line below: > {code} > /** Return an RDD with the pairs from `this` whose keys are not in `other`. */ > {code} > So, we could work around this as below: > {code} > /** > * Return an RDD with the pairs from `this` whose keys are not in `other`. > */ > {code} > Please refer the image in the pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks
[ https://issues.apache.org/jira/browse/SPARK-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18615: Assignee: (was: Apache Spark) > Switch to multi-line doc to avoid a genjavadoc bug for backticks > > > Key: SPARK-18615 > URL: https://issues.apache.org/jira/browse/SPARK-18615 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Hyukjin Kwon >Priority: Minor > > I suspect this is related with SPARK-16153 and genjavadoc issue in > https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure. > Currently, single line comment does not mark down backticks to > {{..}} but prints as they are. For example, the line below: > {code} > /** Return an RDD with the pairs from `this` whose keys are not in `other`. */ > {code} > So, we could work around this as below: > {code} > /** > * Return an RDD with the pairs from `this` whose keys are not in `other`. > */ > {code} > Please refer the image in the pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks
[ https://issues.apache.org/jira/browse/SPARK-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703880#comment-15703880 ] Apache Spark commented on SPARK-18615: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16050 > Switch to multi-line doc to avoid a genjavadoc bug for backticks > > > Key: SPARK-18615 > URL: https://issues.apache.org/jira/browse/SPARK-18615 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Hyukjin Kwon >Priority: Minor > > I suspect this is related with SPARK-16153 and genjavadoc issue in > https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure. > Currently, single line comment does not mark down backticks to > {{..}} but prints as they are. For example, the line below: > {code} > /** Return an RDD with the pairs from `this` whose keys are not in `other`. */ > {code} > So, we could work around this as below: > {code} > /** > * Return an RDD with the pairs from `this` whose keys are not in `other`. > */ > {code} > Please refer the image in the pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks
Hyukjin Kwon created SPARK-18615: Summary: Switch to multi-line doc to avoid a genjavadoc bug for backticks Key: SPARK-18615 URL: https://issues.apache.org/jira/browse/SPARK-18615 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Hyukjin Kwon Priority: Minor I suspect this is related with SPARK-16153 and genjavadoc issue in https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure. Currently, single line comment does not mark down backticks to {{..}} but prints as they are. For example, the line below: {code} /** Return an RDD with the pairs from `this` whose keys are not in `other`. */ {code} So, we could work around this as below: {code} /** * Return an RDD with the pairs from `this` whose keys are not in `other`. */ {code} Please refer the image in the pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703853#comment-15703853 ] Wenchen Fan commented on SPARK-18220: - BTW, the table created by Spark 1.X, are you able to read it with Spark 1.X? > ClassCastException occurs when using select query on ORC file > - > > Key: SPARK-18220 > URL: https://issues.apache.org/jira/browse/SPARK-18220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jerryjung > Labels: orcfile, sql > > Error message is below. > {noformat} > == > 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from > hdfs://xxx/part-00022 with {include: [true], offset: 0, length: > 9223372036854775807} > 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). > 1220 bytes result sent to driver > 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID > 42) in 116 ms on localhost (executor driver) (19/20) > 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID > 35) > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ORC dump info. > == > File Version: 0.12 with HIVE_8732 > 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from > hdfs://XXX/part-0 with {include: null, offset: 0, length: > 9223372036854775807} > 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on > read. Using file schema. > Rows: 7 > Compression: ZLIB > Compression size: 262144 > Type: > struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted
[ https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703851#comment-15703851 ] Mridul Muralidharan commented on SPARK-16554: - It would also be good if we can also move currently hosted blocks off the node before it is killed as a best case effort. This will reduce recomputation (if single copy of rdd block) and/or prevent under replication (if block replication > 1) > Spark should kill executors when they are blacklisted > - > > Key: SPARK-16554 > URL: https://issues.apache.org/jira/browse/SPARK-16554 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: Imran Rashid > > SPARK-8425 will allow blacklisting faulty executors and nodes. However, > these blacklisted executors will continue to run. This is bad for a few > reasons: > (1) Even if there is faulty-hardware, if the cluster is under-utilized spark > may be able to request another executor on a different node. > (2) If there is a faulty-disk (the most common case of faulty-hardware), the > cluster manager may be able to allocate another executor on the same node, if > it can exclude the bad disk. (Yarn will do this with its disk-health > checker.) > With dynamic allocation, this may seem less critical, as a blacklisted > executor will stop running new tasks and eventually get reclaimed. However, > if there is cached data on those executors, they will not get killed till > {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is > (effectively) infinite by default. > Users may not *always* want to kill bad executors, so this must be > configurable to some extent. At a minimum, it should be possible to enable / > disable it; perhaps the executor should be killed after it has been > blacklisted a configurable {{N}} times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703848#comment-15703848 ] Wenchen Fan commented on SPARK-18220: - The orc file is written by Hive, not by Spark SQL, can you use `CREATE TABLE ... AS SELECT ... FROM hive_table` to make Spark SQL to write out the orc file and try again? > ClassCastException occurs when using select query on ORC file > - > > Key: SPARK-18220 > URL: https://issues.apache.org/jira/browse/SPARK-18220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jerryjung > Labels: orcfile, sql > > Error message is below. > {noformat} > == > 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from > hdfs://xxx/part-00022 with {include: [true], offset: 0, length: > 9223372036854775807} > 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). > 1220 bytes result sent to driver > 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID > 42) in 116 ms on localhost (executor driver) (19/20) > 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID > 35) > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ORC dump info. > == > File Version: 0.12 with HIVE_8732 > 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from > hdfs://XXX/part-0 with {include: null, offset: 0, length: > 9223372036854775807} > 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on > read. Using file schema. > Rows: 7 > Compression: ZLIB > Compression size: 262144 > Type: > struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16282) Implement percentile SQL function
[ https://issues.apache.org/jira/browse/SPARK-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703842#comment-15703842 ] Apache Spark commented on SPARK-16282: -- User 'lins05' has created a pull request for this issue: https://github.com/apache/spark/pull/16049 > Implement percentile SQL function > - > > Key: SPARK-16282 > URL: https://issues.apache.org/jira/browse/SPARK-16282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Jiang Xingbo > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17896) Dataset groupByKey + reduceGroups fails with codegen-related exception
[ https://issues.apache.org/jira/browse/SPARK-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703782#comment-15703782 ] Takeshi Yamamuro commented on SPARK-17896: -- yea, this works well even on master. > Dataset groupByKey + reduceGroups fails with codegen-related exception > -- > > Key: SPARK-17896 > URL: https://issues.apache.org/jira/browse/SPARK-17896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Databricks, MacOS >Reporter: Adam Breindel > > possible regression: works on 2.0, fails on 2.0.1 > following code raises exception related to wholestage codegen: > case class Zip(city:String, zip:String, state:String) > val z1 = Zip("New York", "1", "NY") > val z2 = Zip("New York", "10001", "NY") > val z3 = Zip("Chicago", "60606", "IL") > val zips = sc.parallelize(Seq(z1, z2, z3)).toDS > zips.groupByKey(_.state).reduceGroups((z1, z2) => Zip("*", z1.zip + " " + > z2.zip, z1.state)).show -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-18588: Assignee: Shixiong Zhu > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703755#comment-15703755 ] Shea Parkes commented on SPARK-18541: - Yea, I originally did {{aliasWithMetadata}} because I could monkey-patch without conflicts, but upon reflection, just changing the existing {{alias}} method to accept a {{metadata}} keyword argument should work fine. I'll see about getting a pull request up soon. > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Priority: Minor > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17905) Added test cases for InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17905: Assignee: Apache Spark > Added test cases for InMemoryRelation > - > > Key: SPARK-17905 > URL: https://issues.apache.org/jira/browse/SPARK-17905 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > Added test cases for InMemoryRelation for the following cases > - keep all data types with null or without null > - access only some columns in {{CachedBatch}} > - access {{CachedBatch}} disabling whole stage codegen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17905) Added test cases for InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703708#comment-15703708 ] Apache Spark commented on SPARK-17905: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/15462 > Added test cases for InMemoryRelation > - > > Key: SPARK-17905 > URL: https://issues.apache.org/jira/browse/SPARK-17905 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki > > Added test cases for InMemoryRelation for the following cases > - keep all data types with null or without null > - access only some columns in {{CachedBatch}} > - access {{CachedBatch}} disabling whole stage codegen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17905) Added test cases for InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17905: Assignee: (was: Apache Spark) > Added test cases for InMemoryRelation > - > > Key: SPARK-17905 > URL: https://issues.apache.org/jira/browse/SPARK-17905 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki > > Added test cases for InMemoryRelation for the following cases > - keep all data types with null or without null > - access only some columns in {{CachedBatch}} > - access {{CachedBatch}} disabling whole stage codegen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18460) Include triggerDetails in StreamingQueryStatus.json
[ https://issues.apache.org/jira/browse/SPARK-18460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-18460. --- Resolution: Fixed It's in 2.1.0 RC1. > Include triggerDetails in StreamingQueryStatus.json > --- > > Key: SPARK-18460 > URL: https://issues.apache.org/jira/browse/SPARK-18460 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703583#comment-15703583 ] Dongjoon Hyun commented on SPARK-17680: --- It's marked as resolved. > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-17680: --- > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-17680. --- Resolution: Fixed > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-17680. --- Resolution: Fixed > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth resolved SPARK-18551. -- Resolution: Won't Fix Closing this based on complexity and security concerns raised by [~vanzin] > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703562#comment-15703562 ] Alex Bozarth commented on SPARK-18551: -- Ok, now I see, I will close this and my pr and if anyone ask about this in the future I will pass on this answer. This is a very straightforward and smart reason to drop this, thanks. > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC
[ https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703558#comment-15703558 ] Apache Spark commented on SPARK-17783: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16047 > Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP > Table for JDBC > --- > > Key: SPARK-17783 > URL: https://issues.apache.org/jira/browse/SPARK-17783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Fix For: 2.1.0 > > > We should never expose the Credentials in the EXPLAIN and DESC > FORMATTED/EXTENDED command. However, below commands exposed the credentials. > {noformat} > CREATE TABLE tab1 USING org.apache.spark.sql.jdbc > {noformat} > {noformat} > == Physical Plan == > ExecutedCommand >+- CreateDataSourceTableCommand CatalogTable( > Table: `tab1` > Created: Tue Oct 04 21:39:44 PDT 2016 > Last Access: Wed Dec 31 15:59:59 PST 1969 > Type: MANAGED > Provider: org.apache.spark.sql.jdbc > Storage(Properties: > [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, > dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false > {noformat} > {noformat} > DESC FORMATTED tab1 > {noformat} > {noformat} > ... > |# Storage Information | > | | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | path > |file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1| | > | url > |jdbc:h2:mem:testdb0;user=testUser;password=testPass | | > | dbtable |TEST.PEOPLE > | | > | user |testUser > | | > | password |testPass > | | > ++--+---+ > {noformat} > {noformat} > DESC EXTENDED tab1 > {noformat} > {noformat} > ... > Storage(Properties: > [path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, > url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, > user=testUser, password=testPass]))| | > {noformat} > {noformat} > CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc > {noformat} > {noformat} > == Physical Plan == > ExecutedCommand >+- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url > -> jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> > TEST.PEOPLE, user -> testUser, password -> testPass) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703487#comment-15703487 ] Marcelo Vanzin commented on SPARK-18551: I doubt code will convince me, because my main issue is the complexity being added. Why should we make the SHS more complicated and open all these security discussions in the first place? The scenario you mention is self-inflicted. If the admin wants users to be able to clean up their logs, he should just let the users do so, like in any normal installation. If some product is wrapping Spark and intentionally disabling access to these things, then it's that product's problem to solve the problem it's creating, not Spark's. > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16845: -- Component/s: (was: ML) (was: MLlib) SQL > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18320. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 LGTM; I can't recall other major additions. I also expect us to do some more Scala vs Python checks as we continue migration to spark.ml in the next release or 2. Thank you for checking! I'll mark this as resolved. This will be updated to fix version 2.1.0 once RC2 is cut. > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Blocker > Fix For: 2.1.1, 2.2.0 > > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703467#comment-15703467 ] Joseph K. Bradley commented on SPARK-18408: --- Note this is marked now as 2.1.1 b/c of RC1 being cut, but it will be changed to 2.1.0 once RC2 is cut. > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni >Assignee: Yun Ni > Fix For: 2.1.1, 2.2.0 > > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703466#comment-15703466 ] Alex Bozarth commented on SPARK-18551: -- I'll take a shot at convincing you with my code, I'll work on some updates with your comments in mind and if you shoot ithem down I'll accept it at that. Thanks for all the input, I knew this would be a bit controversial when I opened it, but I hadn't properly considered the security issues. > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18408: -- Fix Version/s: (was: 2.1.0) 2.1.1 > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni >Assignee: Yun Ni > Fix For: 2.1.1, 2.2.0 > > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18408. --- Resolution: Fixed Fix Version/s: 2.1.0 2.2.0 Issue resolved by pull request 15874 [https://github.com/apache/spark/pull/15874] > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > Fix For: 2.2.0, 2.1.0 > > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703455#comment-15703455 ] Miao Wang commented on SPARK-18332: --- For some reason, I didn't receive the `CC` notification for this JIRA. I can help on this. > SparkR 2.1 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18408: -- Assignee: Yun Ni > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni >Assignee: Yun Ni > Fix For: 2.1.0, 2.2.0 > > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18558) spark-csv: infer data type for mixed integer/null columns causes exception
[ https://issues.apache.org/jira/browse/SPARK-18558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703424#comment-15703424 ] Miao Wang commented on SPARK-18558: --- scala> val df = spark.read.option("header", "true").option("inferSchema", "true").format("csv").load("example.csv") df: org.apache.spark.sql.DataFrame = [column1: int] scala> df.printSchema root |-- column1: integer (nullable = true) scala> scala> scala> df.show(5) +---+ |column1| +---+ | 1| | 2| | null| +---+ Same here. > spark-csv: infer data type for mixed integer/null columns causes exception > -- > > Key: SPARK-18558 > URL: https://issues.apache.org/jira/browse/SPARK-18558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Peter Rose > > Null pointer exception when using the following csv file: > example.csv: > column1 > "1" > "2" > "" > Dataset df = spark > .read() > .option("header", "true") > .option("inferSchema", "true") > .format("csv") > .load(example.csv); > df.printSchema(); > The type is correctly inferred: > root > |-- col1: integer (nullable = true) > df.show(5); > The show method leads to this exception: > java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:542) ~[?:1.8.0_25] > at java.lang.Integer.parseInt(Integer.java:615) ~[?:1.8.0_25] > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > ~[scala-library-2.11.8.jar:?] > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241) > ~[spark-sql_2.11-2.0.2.jar:2.0.2] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double
[ https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18527: -- Assignee: Jiang Xingbo (was: Herman van Hovell) > UDAFPercentile (bigint, array) needs explicity cast to double > - > > Key: SPARK-18527 > URL: https://issues.apache.org/jira/browse/SPARK-18527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell >Reporter: Fabian Boehnlein >Assignee: Jiang Xingbo > Fix For: 2.1.0 > > > Same bug as SPARK-16228 but > {code}_FUNC_(bigint, array) {code} > instead of > {code}_FUNC_(bigint, double){code} > Fix of SPARK-16228 only fixes the non-array case that was hit. > {code} > sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)") > {code} > fails in Spark 2 shell. > Longer example > {code} > case class Record(key: Long, value: String) > val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, > s"val_$i"))) > recordsDF.createOrReplaceTempView("records") > sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, > 0.2, 0.1)) AS test FROM records") > org.apache.spark.sql.AnalysisException: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.had > oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible > choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164) > at > org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56) > at > org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double
[ https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18527. --- Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.1.0 This has been fixes by merging the Percentile UDAF into branch-2.1 > UDAFPercentile (bigint, array) needs explicity cast to double > - > > Key: SPARK-18527 > URL: https://issues.apache.org/jira/browse/SPARK-18527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell >Reporter: Fabian Boehnlein >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > Same bug as SPARK-16228 but > {code}_FUNC_(bigint, array) {code} > instead of > {code}_FUNC_(bigint, double){code} > Fix of SPARK-16228 only fixes the non-array case that was hit. > {code} > sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)") > {code} > fails in Spark 2 shell. > Longer example > {code} > case class Record(key: Long, value: String) > val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, > s"val_$i"))) > recordsDF.createOrReplaceTempView("records") > sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, > 0.2, 0.1)) AS test FROM records") > org.apache.spark.sql.AnalysisException: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.had > oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible > choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164) > at > org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56) > at > org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double
[ https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703414#comment-15703414 ] Herman van Hovell edited comment on SPARK-18527 at 11/28/16 10:59 PM: -- This has been fixed by merging the Percentile UDAF into branch-2.1 was (Author: hvanhovell): This has been fixes by merging the Percentile UDAF into branch-2.1 > UDAFPercentile (bigint, array) needs explicity cast to double > - > > Key: SPARK-18527 > URL: https://issues.apache.org/jira/browse/SPARK-18527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell >Reporter: Fabian Boehnlein >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > Same bug as SPARK-16228 but > {code}_FUNC_(bigint, array) {code} > instead of > {code}_FUNC_(bigint, double){code} > Fix of SPARK-16228 only fixes the non-array case that was hit. > {code} > sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)") > {code} > fails in Spark 2 shell. > Longer example > {code} > case class Record(key: Long, value: String) > val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, > s"val_$i"))) > recordsDF.createOrReplaceTempView("records") > sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, > 0.2, 0.1)) AS test FROM records") > org.apache.spark.sql.AnalysisException: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.had > oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible > choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164) > at > org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56) > at > org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703410#comment-15703410 ] Marcelo Vanzin edited comment on SPARK-18551 at 11/28/16 10:58 PM: --- I'll ignore unsecured services here; as SteveL mentions, it's trivial to do whatever you want if HDFS is not secured. bq. For your first point, my thinking was that the person who started the SHS and enabled the deletion config would be the liable user. That's how your change works. But more often than not, the service will be running as some system user (e.g. "spark") and not as the user who started the service (more often than not some admin). Documenting is the bare minimum, but really when exposing this kind of functionality I expect at least some security to exist around it. bq. I'm not quite sure what you mean about the UI's auth code though If user "alice" runs an application, user "bob" should not be able to delete its logs. That works in HDFS because the directory where the logs are stored has the sticky bit set, and "bob" cannot write to "alice"'s logs. So "bob" cannot delete "alice"'s logs either. But here, without applying the ACLs the application has set up to this feature, you would be allowing that scenario. Yes, I'm talking about the security manager, but it only exists at the application level currently; if you look at its config, it has a bunch of ACL-related configs which are only enforced by the Spark UI (and not by the parent UI of the history server). bq. The user here has no access to the log folder without going through their admin. My feeling about that is that the admin is creating the problem and it's not for the SHS to fix it by creating a bunch of other problems. The user needs access to the file system to write the logs in the first place, so he should have access to the file system to delete the logs if he wants to. I currently think this feature brings more issues than it solves, but you can try to convince me otherwise. At the very, very least it should be disabled by default, which makes it less useful from the get go... was (Author: vanzin): I'll ignore unsecured services here; as SteveL mentions, it's trivial to do whatever you want if HDFS is not secured. bq. For your first point, my thinking was that the person who started the SHS and enabled the deletion config would be the liable user. That's how your change works. But more often than not, the service will be running as some system user (e.g. "spark") and not as the user who started the service (more often than not some admin). Documenting is the bare minimum, but really when exposing this kind of functionality I expect at least some security to exist around it. bq. I'm not quite sure what you mean about the UI's auth code though If user "alice" runs an application, user "bob" should not be able to delete its logs. That works in HDFS because the directory where the logs are stored has the sticky bit set, and "bob" cannot write to "alice"'s logs. So "bob" cannot delete "alice"'s logs either. But here, without application the ACLs the application has set up to this feature, you would be allowing that scenario. Yes, I'm talking about the security manager, but it only exists at the application level currently; if you look at its config, it has a bunch of ACL-related configs which are only enforced by the Spark UI (and not by the parent UI of the history server). bq. The user here has no access to the log folder without going through their admin. My feeling about that is that the admin is creating the problem and it's not for the SHS to fix it by creating a bunch of other problems. The user needs access to the file system to write the logs in the first place, so he should have access to the file system to delete the logs if he wants to. I currently think this feature brings more issues than it solves, but you can try to convince me otherwise. At the very, very least it should be disabled by default, which makes it less useful from the get go... > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent
[jira] [Commented] (SPARK-18551) Add functionality to delete event logs from the History Server UI
[ https://issues.apache.org/jira/browse/SPARK-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703410#comment-15703410 ] Marcelo Vanzin commented on SPARK-18551: I'll ignore unsecured services here; as SteveL mentions, it's trivial to do whatever you want if HDFS is not secured. bq. For your first point, my thinking was that the person who started the SHS and enabled the deletion config would be the liable user. That's how your change works. But more often than not, the service will be running as some system user (e.g. "spark") and not as the user who started the service (more often than not some admin). Documenting is the bare minimum, but really when exposing this kind of functionality I expect at least some security to exist around it. bq. I'm not quite sure what you mean about the UI's auth code though If user "alice" runs an application, user "bob" should not be able to delete its logs. That works in HDFS because the directory where the logs are stored has the sticky bit set, and "bob" cannot write to "alice"'s logs. So "bob" cannot delete "alice"'s logs either. But here, without application the ACLs the application has set up to this feature, you would be allowing that scenario. Yes, I'm talking about the security manager, but it only exists at the application level currently; if you look at its config, it has a bunch of ACL-related configs which are only enforced by the Spark UI (and not by the parent UI of the history server). bq. The user here has no access to the log folder without going through their admin. My feeling about that is that the admin is creating the problem and it's not for the SHS to fix it by creating a bunch of other problems. The user needs access to the file system to write the logs in the first place, so he should have access to the file system to delete the logs if he wants to. I currently think this feature brings more issues than it solves, but you can try to convince me otherwise. At the very, very least it should be disabled by default, which makes it less useful from the get go... > Add functionality to delete event logs from the History Server UI > - > > Key: SPARK-18551 > URL: https://issues.apache.org/jira/browse/SPARK-18551 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Alex Bozarth > > Sometimes a Spark user will only have access to a History Server to interact > with their (past) applications. But without access to the server they can > only delete applications through use of the FS Cleaner feature, which itself > can only clean logs older than a set date. > I propose adding the ability to delete specific applications via the History > Server UI with the default setting to off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs
[ https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703396#comment-15703396 ] yuhao yang commented on SPARK-18613: Sounds good to me. Will you send a PR? > spark.ml LDA classes should not expose spark.mllib in APIs > -- > > Key: SPARK-18613 > URL: https://issues.apache.org/jira/browse/SPARK-18613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it > should not: > * {{def oldLocalModel: OldLocalLDAModel}} > * {{def getModel: OldLDAModel}} > This task is to deprecate those methods. I recommend creating > {{private[ml]}} versions of the methods which are used internally in order to > avoid deprecation warnings. > Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries
[ https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18582: Assignee: (was: Apache Spark) > Whitelist LogicalPlan operators allowed in correlated subqueries > > > Key: SPARK-18582 > URL: https://issues.apache.org/jira/browse/SPARK-18582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > We want to tighten the code that handles correlated subquery to whitelist > operators that are allowed in it. > The current code in {{def pullOutCorrelatedPredicates}} looks like > {code} > // Simplify the predicates before pulling them out. > val transformed = BooleanSimplification(sub) transformUp { > case f @ Filter(cond, child) => ... > case p @ Project(expressions, child) => ... > case a @ Aggregate(grouping, expressions, child) => ... > case w : Window => ... > case j @ Join(left, _, RightOuter, _) => ... > case j @ Join(left, right, FullOuter, _) => ... > case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ... > case u: Union => ... > case s: SetOperation => ... > case e: Expand => ... > case l : LocalLimit => ... > case g : GlobalLimit => ... > case s : Sample => ... > case p => > failOnOuterReference(p) > ... > } > {code} > The code disallows operators in a sub plan of an operator hosting correlation > on a case by case basis. As it is today, it only blocks {{Union}}, > {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} > {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of > {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the > list above are permitted to be under a correlation point. Is this risky? > There are many (30+ at least from browsing the {{LogicalPlan}} type > hierarchy) operators derived from {{LogicalPlan}} class. > For the case of {{ScalarSubquery}}, it explicitly checks that only > {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed > ({{CheckAnalysis.scala}} around line 126-165 in and after {{def > cleanQuery}}). We should whitelist which operators are allowed in correlated > subqueries. At my first glance, we should allow, in addition to the ones > allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries
[ https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703374#comment-15703374 ] Apache Spark commented on SPARK-18582: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/16046 > Whitelist LogicalPlan operators allowed in correlated subqueries > > > Key: SPARK-18582 > URL: https://issues.apache.org/jira/browse/SPARK-18582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > We want to tighten the code that handles correlated subquery to whitelist > operators that are allowed in it. > The current code in {{def pullOutCorrelatedPredicates}} looks like > {code} > // Simplify the predicates before pulling them out. > val transformed = BooleanSimplification(sub) transformUp { > case f @ Filter(cond, child) => ... > case p @ Project(expressions, child) => ... > case a @ Aggregate(grouping, expressions, child) => ... > case w : Window => ... > case j @ Join(left, _, RightOuter, _) => ... > case j @ Join(left, right, FullOuter, _) => ... > case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ... > case u: Union => ... > case s: SetOperation => ... > case e: Expand => ... > case l : LocalLimit => ... > case g : GlobalLimit => ... > case s : Sample => ... > case p => > failOnOuterReference(p) > ... > } > {code} > The code disallows operators in a sub plan of an operator hosting correlation > on a case by case basis. As it is today, it only blocks {{Union}}, > {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} > {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of > {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the > list above are permitted to be under a correlation point. Is this risky? > There are many (30+ at least from browsing the {{LogicalPlan}} type > hierarchy) operators derived from {{LogicalPlan}} class. > For the case of {{ScalarSubquery}}, it explicitly checks that only > {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed > ({{CheckAnalysis.scala}} around line 126-165 in and after {{def > cleanQuery}}). We should whitelist which operators are allowed in correlated > subqueries. At my first glance, we should allow, in addition to the ones > allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries
[ https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18582: Assignee: Apache Spark > Whitelist LogicalPlan operators allowed in correlated subqueries > > > Key: SPARK-18582 > URL: https://issues.apache.org/jira/browse/SPARK-18582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > > We want to tighten the code that handles correlated subquery to whitelist > operators that are allowed in it. > The current code in {{def pullOutCorrelatedPredicates}} looks like > {code} > // Simplify the predicates before pulling them out. > val transformed = BooleanSimplification(sub) transformUp { > case f @ Filter(cond, child) => ... > case p @ Project(expressions, child) => ... > case a @ Aggregate(grouping, expressions, child) => ... > case w : Window => ... > case j @ Join(left, _, RightOuter, _) => ... > case j @ Join(left, right, FullOuter, _) => ... > case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ... > case u: Union => ... > case s: SetOperation => ... > case e: Expand => ... > case l : LocalLimit => ... > case g : GlobalLimit => ... > case s : Sample => ... > case p => > failOnOuterReference(p) > ... > } > {code} > The code disallows operators in a sub plan of an operator hosting correlation > on a case by case basis. As it is today, it only blocks {{Union}}, > {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} > {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of > {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the > list above are permitted to be under a correlation point. Is this risky? > There are many (30+ at least from browsing the {{LogicalPlan}} type > hierarchy) operators derived from {{LogicalPlan}} class. > For the case of {{ScalarSubquery}}, it explicitly checks that only > {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed > ({{CheckAnalysis.scala}} around line 126-165 in and after {{def > cleanQuery}}). We should whitelist which operators are allowed in correlated > subqueries. At my first glance, we should allow, in addition to the ones > allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703328#comment-15703328 ] Heji Kim commented on SPARK-18506: -- Hi Cody, I am putting one last ditch effort into getting this to work. Could you send me more details about your test setup. Spark cluster- exact number of ec2 instances with instance type? Is three machines- one separate master and 2 separate nodes? Kafka cluster- exact number of ec2 instances with instance type? Thanks, Heji > kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a > single partition on a multi partition topic > --- > > Key: SPARK-18506 > URL: https://issues.apache.org/jira/browse/SPARK-18506 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark > standalone mode 2.0.2 > with Kafka 0.10.1.0. >Reporter: Heji Kim > > Our team is trying to upgrade to Spark 2.0.2/Kafka > 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our > drivers to read all partitions of a single stream when kafka > auto.offset.reset=earliest running on a real cluster(separate VM nodes). > When we run our drivers with auto.offset.reset=latest ingesting from a single > kafka topic with multiple partitions (usually 10 but problem shows up with > only 3 partitions), the driver reads correctly from all partitions. > Unfortunately, we need "earliest" for exactly once semantics. > In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using > spark-streaming-kafka-0-8_2.11 with the prior setting > auto.offset.reset=smallest runs correctly. > We have tried the following configurations in trying to isolate our problem > but it is only auto.offset.reset=earliest on a "real multi-machine cluster" > which causes this problem. > 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each) > instead of YARN 2.7.3. Single partition read problem persists both cases. > Please note this problem occurs on an actual cluster of separate VM nodes > (but not when our engineer runs in as a cluster on his own Mac.) > 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists. > 3. Turned off checkpointing. Problem persists with or without checkpointing. > 4. Turned off backpressure. Problem persists with or without backpressure. > 5. Tried both partition.assignment.strategy RangeAssignor and > RoundRobinAssignor. Broken with both. > 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with > both. > 7. Tried the simplest scala driver that only logs. (Our team uses java.) > Broken with both. > 8. Tried increasing GCE capacity for cluster but already we were highly > overprovisioned for cores and memory. Also tried ramping up executors and > cores. Since driver works with auto.offset.reset=latest, we have ruled out > GCP cloud infrastructure issues. > When we turn on the debug logs, we sometimes see partitions being set to > different offset configuration even though the consumer config correctly > indicates auto.offset.reset=earliest. > {noformat} > 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 TRACE Sending ListOffsetRequest > {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]} > to broker 10.102.20.12:9092 (id: 12 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 TRACE Sending ListOffsetRequest > {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]} > to broker 10.102.20.13:9092 (id: 13 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 TRACE Received ListOffsetResponse > {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]} > from broker 10.102.20.12:9092 (id: 12 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 TRACE Received ListOffsetResponse > {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]} > from broker 10.102.20.13:9092 (id: 13 rack: null) > (org.apache.kafka.clients.consumer.internals.Fetcher) > 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 > (org.apache.kafka.clients.consumer.internals.Fetcher) > 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 > (org.apache.kafka.clients.consumer.internals.Fetcher) > {noformat} > I've enclosed below the completely stripped dow
[jira] [Commented] (SPARK-17896) Dataset groupByKey + reduceGroups fails with codegen-related exception
[ https://issues.apache.org/jira/browse/SPARK-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703320#comment-15703320 ] Andrew Ray commented on SPARK-17896: The given code seems to work in 2.0.2 > Dataset groupByKey + reduceGroups fails with codegen-related exception > -- > > Key: SPARK-17896 > URL: https://issues.apache.org/jira/browse/SPARK-17896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Databricks, MacOS >Reporter: Adam Breindel > > possible regression: works on 2.0, fails on 2.0.1 > following code raises exception related to wholestage codegen: > case class Zip(city:String, zip:String, state:String) > val z1 = Zip("New York", "1", "NY") > val z2 = Zip("New York", "10001", "NY") > val z3 = Zip("Chicago", "60606", "IL") > val zips = sc.parallelize(Seq(z1, z2, z3)).toDS > zips.groupByKey(_.state).reduceGroups((z1, z2) => Zip("*", z1.zip + " " + > z2.zip, z1.state)).show -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18553) Executor loss may cause TaskSetManager to be leaked
[ https://issues.apache.org/jira/browse/SPARK-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703271#comment-15703271 ] Apache Spark commented on SPARK-18553: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/16045 > Executor loss may cause TaskSetManager to be leaked > --- > > Key: SPARK-18553 > URL: https://issues.apache.org/jira/browse/SPARK-18553 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 2.0.3 > > > Due to a bug in TaskSchedulerImpl, the complete sudden loss of an executor > may cause a TaskSetManager to be leaked, causing ShuffleDependencies and > other data structures to be kept alive indefinitely, leading to various types > of resource leaks (including shuffle file leaks). > In a nutshell, the problem is that TaskSchedulerImpl did not maintain its own > mapping from executorId to running task ids, leaving it unable to clean up > taskId to taskSetManager maps when an executor is totally lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18614: Assignee: Apache Spark > Incorrect predicate pushdown from ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703240#comment-15703240 ] Apache Spark commented on SPARK-18614: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/16044 > Incorrect predicate pushdown from ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18614: Assignee: (was: Apache Spark) > Incorrect predicate pushdown from ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16554) Spark should kill executors when they are blacklisted
[ https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-16554: - Assignee: (was: Imran Rashid) > Spark should kill executors when they are blacklisted > - > > Key: SPARK-16554 > URL: https://issues.apache.org/jira/browse/SPARK-16554 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: Imran Rashid > > SPARK-8425 will allow blacklisting faulty executors and nodes. However, > these blacklisted executors will continue to run. This is bad for a few > reasons: > (1) Even if there is faulty-hardware, if the cluster is under-utilized spark > may be able to request another executor on a different node. > (2) If there is a faulty-disk (the most common case of faulty-hardware), the > cluster manager may be able to allocate another executor on the same node, if > it can exclude the bad disk. (Yarn will do this with its disk-health > checker.) > With dynamic allocation, this may seem less critical, as a blacklisted > executor will stop running new tasks and eventually get reclaimed. However, > if there is cached data on those executors, they will not get killed till > {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is > (effectively) infinite by default. > Users may not *always* want to kill bad executors, so this must be > configurable to some extent. At a minimum, it should be possible to enable / > disable it; perhaps the executor should be killed after it has been > blacklisted a configurable {{N}} times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong updated SPARK-18614: Summary: Incorrect predicate pushdown from ExistenceJoin (was: Incorrect predicate pushdown thru ExistenceJoin) > Incorrect predicate pushdown from ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18614) Incorrect predicate pushdown thru ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703190#comment-15703190 ] Xiao Li commented on SPARK-18614: - Since it does not affect the correctness of the query results, I remove the label `correctness` > Incorrect predicate pushdown thru ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18614) Incorrect predicate pushdown thru ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18614: Labels: (was: correctness) > Incorrect predicate pushdown thru ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18614) Incorrect predicate pushdown thru ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15703185#comment-15703185 ] Nattavut Sutyanyong commented on SPARK-18614: - {{ExistenceJoin}} should be treated the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ...}} to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. During the transformation in the rule {{PushPredicateThroughJoin}}, an ExistenceJoin never exists. The semantics of {{ExistenceJoin}} says we need to preserve all the rows from the left table through the join operation as if it is a regular {{LeftOuter}} join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new column called {{exists}}, set to true when the join condition in the ON clause is true and false otherwise. The filter of any rows will happen in the {{Filter}} operation above the {{ExistenceJoin}}. Example: A(c1, c2): \{ (1, 1), (1, 2) \} // B can be any value as it is irrelevant in this example B(c1): \{ (NULL) \} {code:SQL} select A.* from A where exists (select 1 from B where A.c1 = A.c2) or A.c2=2 {code} In this example, the correct result is all the rows from A. If the pattern {{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the work in SPARK-18597 is indeed active, the code will push down the predicate A.c1 = A.c2 to be a {{Filter}} on relation A, which will filter the row (1,2) from A. > Incorrect predicate pushdown thru ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > Labels: correctness > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org