[jira] [Updated] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
[ https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14194: -- Affects Version/s: 2.1.0 > spark csv reader not working properly if CSV content contains CRLF character > (newline) in the intermediate cell > --- > > Key: SPARK-14194 > URL: https://issues.apache.org/jira/browse/SPARK-14194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.1.0 >Reporter: Kumaresh C R > > We have CSV content like below, > Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r > "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), > Municapality,","USA", "1234567" > Since there is a '\n\r' character in the row middle (to be exact in the > Address Column), when we execute the below spark code, it tries to create the > dataframe with two rows (excluding header row), which is wrong. Since we have > specified delimiter as quote (") character , why it takes the middle > character as newline character ? This creates an issue while processing the > created dataframe. > DataFrame df = > sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", delim) > .option("quote", quote) > .option("escape", escape) > .load(sourceFile); > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19646) binaryRecords replicates records in scala API
BahaaEddin AlAila created SPARK-19646: - Summary: binaryRecords replicates records in scala API Key: SPARK-19646 URL: https://issues.apache.org/jira/browse/SPARK-19646 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.0 Reporter: BahaaEddin AlAila Priority: Minor The scala sc.binaryRecords replicates one record for the entire set. for example, I am trying to load the cifar binary data where in a big binary file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label label. The file resides on my local filesystem. .take(5) returns 5 records all the same, .collect() returns 10,000 records all the same. What is puzzling is that the pyspark one works perfectly even though underneath it is calling the scala implementation. I have tested this on 2.1.0 and 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19645) structured streaming job restart bug
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute WAL offsets and generate the same hdfs delta file(latest delta file generated before restart and named "currentBatchId.delta") . In my opinion, this is a bug. If you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at
[jira] [Updated] (SPARK-19645) structured streaming job restart bug
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute WAL offsets and generate the same hdfs delta file(generated before restart and named "currentBatchId.delta") . In my opinion, this is a bug. If you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
[jira] [Updated] (SPARK-19645) structured streaming job restart bug
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute WAL offsets and generate the same hdfs delta file named "currentBatchId.delta". In my opinion, this is a bug. If you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at
[jira] [Updated] (SPARK-19645) structured streaming job restart bug
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Summary: structured streaming job restart bug (was: structured streaming job restart) > structured streaming job restart bug > > > Key: SPARK-19645 > URL: https://issues.apache.org/jira/browse/SPARK-19645 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: guifeng >Priority: Critical > > We are trying to use Structured Streaming in product, however currently > there exists a bug refer to the process of streaming job restart. > The following is the concrete error message: > {quote} >Caused by: java.lang.IllegalStateException: Error committing version 2 > into HDFSStateStore[id = (op=0, part=136), dir = > /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to rename > /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 > to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) > ... 14 more > {quote} > The bug can be easily reproduce when restart previous streaming job, and > the main reason is that when restart streaming job spark will recompute WAL > offsets and generate the same hdfs delta file whose name is > "currentBatchId.delta". In my opinion, this is a bug. If you guy consider > that this is a bug also, I can fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19645) structured streaming job restart
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute WAL offsets and generate the same hdfs delta file whose name is "currentBatchId.delta". In my opinion, this is a bug. If you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at
[jira] [Updated] (SPARK-19645) structured streaming job restart
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute WAL offsets and generate the same hdfs delta file whose name is "currentBatchId.delta". In my opinion, this is a bug, and if you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at
[jira] [Updated] (SPARK-19645) structured streaming job restart
[ https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guifeng updated SPARK-19645: Description: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute wal offsets and generate the same hdfs delta file whose name is "currentBatchId.delta". In my opinion, this is a bug, and if you guy consider that this is a bug also, I can fix it. was: We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
[jira] [Created] (SPARK-19645) structured streaming job restart
guifeng created SPARK-19645: --- Summary: structured streaming job restart Key: SPARK-19645 URL: https://issues.apache.org/jira/browse/SPARK-19645 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: guifeng Priority: Critical We are trying to use Structured Streaming in product, however currently there exists a bug refer to the process of streaming job restart. The following is the concrete error message: {quote} Caused by: java.lang.IllegalStateException: Error committing version 2 into HDFSStateStore[id = (op=0, part=136), dir = /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to rename /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/2.delta at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156) ... 14 more {quote} The bug can be easily reproduce when restart previous streaming job, and the main reason is that when restart streaming job spark will recompute wal offsets and generate the same hdfs delta file whose name is "currentBatchId.delta". In my opinion, this is a bug, and if you guy consider that this is a bug also, I can fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19644) Memory leak in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deenbandhu Agarwal updated SPARK-19644: --- Affects Version/s: (was: 2.0.1) 2.0.2 Environment: 3 AWS EC2 c3.xLarge Number of cores - 3 Number of executors 3 Memory to each executor 2GB was: 3 AWS EC2 c3.xLarge Number of cores - 3 Number of executers 3 Memory to each executor 2GB Labels: memory_leak performance (was: performance) Component/s: (was: Structured Streaming) DStreams > Memory leak in Spark Streaming > -- > > Key: SPARK-19644 > URL: https://issues.apache.org/jira/browse/SPARK-19644 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: 3 AWS EC2 c3.xLarge > Number of cores - 3 > Number of executors 3 > Memory to each executor 2GB >Reporter: Deenbandhu Agarwal >Priority: Critical > Labels: memory_leak, performance > Attachments: heapdump.png > > > I am using streaming on the production for some aggregation and fetching data > from cassandra and saving data back to cassandra. > I see a gradual increase in old generation heap capacity from 1161216 Bytes > to 1397760 Bytes over a period of six hours. > After 50 hours of processing instances of class > scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a > huge number. > I think this is a clear case of memory leak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19644) Memory leak in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deenbandhu Agarwal updated SPARK-19644: --- Attachment: heapdump.png Snap shot of heap dump after 50 hours > Memory leak in Spark Streaming > -- > > Key: SPARK-19644 > URL: https://issues.apache.org/jira/browse/SPARK-19644 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 > Environment: 3 AWS EC2 c3.xLarge > Number of cores - 3 > Number of executers 3 > Memory to each executor 2GB >Reporter: Deenbandhu Agarwal >Priority: Critical > Labels: performance > Attachments: heapdump.png > > > I am using streaming on the production for some aggregation and fetching data > from cassandra and saving data back to cassandra. > I see a gradual increase in old generation heap capacity from 1161216 Bytes > to 1397760 Bytes over a period of six hours. > After 50 hours of processing instances of class > scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a > huge number. > I think this is a clear case of memory leak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19644) Memory leak in Spark Streaming
Deenbandhu Agarwal created SPARK-19644: -- Summary: Memory leak in Spark Streaming Key: SPARK-19644 URL: https://issues.apache.org/jira/browse/SPARK-19644 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.1 Environment: 3 AWS EC2 c3.xLarge Number of cores - 3 Number of executers 3 Memory to each executor 2GB Reporter: Deenbandhu Agarwal Priority: Critical I am using streaming on the production for some aggregation and fetching data from cassandra and saving data back to cassandra. I see a gradual increase in old generation heap capacity from 1161216 Bytes to 1397760 Bytes over a period of six hours. After 50 hours of processing instances of class scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a huge number. I think this is a clear case of memory leak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871202#comment-15871202 ] Takeshi Yamamuro commented on SPARK-19641: -- okay, thanks! > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871196#comment-15871196 ] Hyukjin Kwon commented on SPARK-19641: -- Ah, thanks for cc'ing me. I happened to see the related comment. I will leave them here just in case anyone is curious. https://github.com/apache/spark/pull/16386#issuecomment-280525182 https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan
[ https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19557: --- Assignee: Wenchen Fan > Output parameters are not present in SQL Query Plan > --- > > Key: SPARK-19557 > URL: https://issues.apache.org/jira/browse/SPARK-19557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > For DataFrameWriter methods like parquet(), json(), csv() etc. output > parameters are not present in the QueryExecution object. For methods like > saveAsTable() they do. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
[ https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18120: --- Assignee: Wenchen Fan > QueryExecutionListener method doesnt' get executed for DataFrameWriter methods > -- > > Key: SPARK-18120 > URL: https://issues.apache.org/jira/browse/SPARK-18120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Salil Surendran >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > QueryExecutionListener is a class that has methods named onSuccess() and > onFailure() that gets called when a query is executed. Each of those methods > takes a QueryExecution object as a parameter which can be used for metrics > analysis. It gets called for several of the DataSet methods like take, head, > first, collect etc. but doesn't get called for any of the DataFrameWriter > methods like saveAsTable, save etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19557) Output parameters are not present in SQL Query Plan
[ https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19557. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16962 [https://github.com/apache/spark/pull/16962] > Output parameters are not present in SQL Query Plan > --- > > Key: SPARK-19557 > URL: https://issues.apache.org/jira/browse/SPARK-19557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran > Fix For: 2.2.0 > > > For DataFrameWriter methods like parquet(), json(), csv() etc. output > parameters are not present in the QueryExecution object. For methods like > saveAsTable() they do. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
[ https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18120. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16962 [https://github.com/apache/spark/pull/16962] > QueryExecutionListener method doesnt' get executed for DataFrameWriter methods > -- > > Key: SPARK-18120 > URL: https://issues.apache.org/jira/browse/SPARK-18120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Salil Surendran > Fix For: 2.2.0 > > > QueryExecutionListener is a class that has methods named onSuccess() and > onFailure() that gets called when a query is executed. Each of those methods > takes a QueryExecution object as a parameter which can be used for metrics > analysis. It gets called for several of the DataSet methods like take, head, > first, collect etc. but doesn't get called for any of the DataFrameWriter > methods like saveAsTable, save etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18352: --- Assignee: Nathan Howell > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nathan Howell > Labels: releasenotes > Fix For: 2.2.0 > > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18352. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16386 [https://github.com/apache/spark/pull/16386] > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Fix For: 2.2.0 > > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields
[ https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871162#comment-15871162 ] Takeshi Yamamuro commented on SPARK-19638: -- The pushing-down stuffs depend on datasources, so this is specific to the ES connector. There is no log entry "Pushing down filters..." in spark, so you'd better to ask the author community of the connector. > Filter pushdown not working for struct fields > - > > Key: SPARK-19638 > URL: https://issues.apache.org/jira/browse/SPARK-19638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk > > Working with a dataset containing struct fields, and enabling debug logging > in the ES connector, I'm seeing the following behavior. The dataframe is > created over the ES connector and then the schema is extended with a couple > column aliases, such as. > {noformat} > df.withColumn("f2", df("foo")) > {noformat} > Queries vs those alias columns work as expected for fields that are > non-struct members. > {noformat} > scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show > 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters > [IsNotNull(foo),EqualTo(foo,1)] > 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL > [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}] > {noformat} > However, try the same with an alias over a struct field, and no filters are > pushed down. > {noformat} > scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == > '1'").limit(1).show > {noformat} > In fact, this is the case even when no alias is used at all. > {noformat} > scala> df.where("bar.baz == '1'").limit(1).show > {noformat} > Basically, pushdown for structs doesn't work at all. > Maybe this is specific to the ES connector? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19643) Document how to use Spark/SparkR on Windows
Felix Cheung created SPARK-19643: Summary: Document how to use Spark/SparkR on Windows Key: SPARK-19643 URL: https://issues.apache.org/jira/browse/SPARK-19643 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung winutl and/or HADOOP_HOME is required to run Spark on Windows, since we have SparkR package that can be run on Windows machine, we should document this in programming guide and more importantly vignettes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871152#comment-15871152 ] Takeshi Yamamuro commented on SPARK-19641: -- Could you show us a simple query to reproduce this? cc: [~hyukjin.kwon] > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19556: Assignee: (was: Apache Spark) > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19556: Assignee: Apache Spark > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871114#comment-15871114 ] Apache Spark commented on SPARK-19573: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16971 > Make NaN/null handling consistent in approxQuantile > --- > > Key: SPARK-19573 > URL: https://issues.apache.org/jira/browse/SPARK-19573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > As discussed in https://github.com/apache/spark/pull/16776, this jira is used > to track the following issue: > Multi-column version of approxQuantile drop the rows containing *any* > NaN/null, the results are not consistent with outputs of the single-version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19573) Make NaN/null handling consistent in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19573: Assignee: (was: Apache Spark) > Make NaN/null handling consistent in approxQuantile > --- > > Key: SPARK-19573 > URL: https://issues.apache.org/jira/browse/SPARK-19573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > As discussed in https://github.com/apache/spark/pull/16776, this jira is used > to track the following issue: > Multi-column version of approxQuantile drop the rows containing *any* > NaN/null, the results are not consistent with outputs of the single-version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19573) Make NaN/null handling consistent in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19573: Assignee: Apache Spark > Make NaN/null handling consistent in approxQuantile > --- > > Key: SPARK-19573 > URL: https://issues.apache.org/jira/browse/SPARK-19573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark > > As discussed in https://github.com/apache/spark/pull/16776, this jira is used > to track the following issue: > Multi-column version of approxQuantile drop the rows containing *any* > NaN/null, the results are not consistent with outputs of the single-version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19623) Take rows from DataFrame with empty first partition
[ https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870990#comment-15870990 ] Jaeboo Jung edited comment on SPARK-19623 at 2/17/17 2:15 AM: -- Increasing driver memory can't clear this issue because memory consumption grows proportionally to the number of partitions. For example, 5g of driver memory processes 1000 partitions properly but OOME occurs in case of 3000 partitions. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,3000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} I think this issue comes from differences of taking rows process between rdd and dataframe. RDD takes rows with its internal method but DataFrame takes rows with Limit SQL process. The weird part is dataframe seeks all the partitions when the first partition is empty. Maybe it is related to SPARK-3211. was (Author: jb jung): Increasing driver memory can't clear this issue because memory consumption grows proportionally to the number of partitions. For example, 5g of driver memory processes 1000 partitions properly but OOME occurs in case of 3000 partitions. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,3000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} I think this issue comes from differences of taking rows process between rdd and dataframe. RDD takes rows with its internal method but DataFrame takes rows with Limit SQL process. The weird part is dataframe scanning all the partitions when the first partition is empty. Maybe it is related to SPARK-3211. > Take rows from DataFrame with empty first partition > --- > > Key: SPARK-19623 > URL: https://issues.apache.org/jira/browse/SPARK-19623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Jaeboo Jung >Priority: Minor > > I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions > having a empty first partition, DataFrame and its RDD have different > behaviors during taking rows from it. If we take only 1000 rows from > DataFrame, it causes OOME but RDD is OK. > In detail, > DataFrame without a empty first partition => OK > DataFrame with a empty first partition => OOME > RDD of DataFrame with a empty first partition => OK > Codes below reproduce this error. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(1 to 1,1000).map(i => > Row.fromSeq(Array.fill(100)(i))) > val schema = StructType(for(i <- 1 to 100) yield { > StructField("COL"+i,IntegerType, true) > }) > val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) > Iterator[Row]() else iter) > val df1 = sqlContext.createDataFrame(rdd,schema) > df1.take(1000) // OK > val df2 = sqlContext.createDataFrame(rdd2,schema) > df2.rdd.take(1000) // OK > df2.take(1000) // OOME > {code} > I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor > memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api and ui
[ https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-19642: -- Summary: Improve the security guarantee for rest api and ui (was: Improve the security guarantee for rest api) > Improve the security guarantee for rest api and ui > -- > > Key: SPARK-19642 > URL: https://issues.apache.org/jira/browse/SPARK-19642 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Critical > > As Spark gets more and more features, data may start leaking through other > places (e.g. SQL query plans which are shown in the UI). Also current rest > api may be a security hole. Open this JIRA to research and address the > potential security flaws. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api
[ https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-19642: -- Description: As Spark gets more and more features, data may start leaking through other places (e.g. SQL query plans which are shown in the UI). Also current rest api may be a security hole. Open this JIRA to research and address the potential security flaws. (was: As Spark gets more and more features, data may start leaking through other places (e.g. SQL query plans which are shown in the UI). Also current rest api may be a security hole. Opening this JIRA to research and address the potential security flaws.) > Improve the security guarantee for rest api > --- > > Key: SPARK-19642 > URL: https://issues.apache.org/jira/browse/SPARK-19642 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Critical > > As Spark gets more and more features, data may start leaking through other > places (e.g. SQL query plans which are shown in the UI). Also current rest > api may be a security hole. Open this JIRA to research and address the > potential security flaws. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api
[ https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-19642: -- Description: As Spark gets more and more features, data may start leaking through other places (e.g. SQL query plans which are shown in the UI). Also current rest api may be a security hole. Opening this JIRA to research and address the potential security flaws. (was: As Spark gets more and more features, data may start leaking through other places (e.g. SQL query plans which are shown in the UI). And current rest api may be a security hole. Opening this JIRA to research and address the potential security flaws.) > Improve the security guarantee for rest api > --- > > Key: SPARK-19642 > URL: https://issues.apache.org/jira/browse/SPARK-19642 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Critical > > As Spark gets more and more features, data may start leaking through other > places (e.g. SQL query plans which are shown in the UI). Also current rest > api may be a security hole. Opening this JIRA to research and address the > potential security flaws. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19642) Improve the security guarantee for rest api
[ https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871058#comment-15871058 ] Genmao Yu commented on SPARK-19642: --- cc [~ajbozarth], [~vanzin] and [~srowen] > Improve the security guarantee for rest api > --- > > Key: SPARK-19642 > URL: https://issues.apache.org/jira/browse/SPARK-19642 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Critical > > As Spark gets more and more features, data may start leaking through other > places (e.g. SQL query plans which are shown in the UI). And current rest api > may be a security hole. Opening this JIRA to research and address the > potential security flaws. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19642) Improve the security guarantee for rest api
Genmao Yu created SPARK-19642: - Summary: Improve the security guarantee for rest api Key: SPARK-19642 URL: https://issues.apache.org/jira/browse/SPARK-19642 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Critical As Spark gets more and more features, data may start leaking through other places (e.g. SQL query plans which are shown in the UI). And current rest api may be a security hole. Opening this JIRA to research and address the potential security flaws. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19625) Authorization Support(on all operations not only DDL) in Spark Sql version 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871035#comment-15871035 ] 翟玉勇 commented on SPARK-19625: - for spark 1.5.2 has https://github.com/apache/spark/pull/10144 https://github.com/apache/spark/pull/11045 but can not work with 2.1.0 can show your code?zhaiyuy...@126.com thx > Authorization Support(on all operations not only DDL) in Spark Sql version > 2.1.0 > > > Key: SPARK-19625 > URL: https://issues.apache.org/jira/browse/SPARK-19625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: 翟玉勇 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
Nathan Howell created SPARK-19641: - Summary: JSON schema inference in DROPMALFORMED mode produces incorrect schema Key: SPARK-19641 URL: https://issues.apache.org/jira/browse/SPARK-19641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Nathan Howell In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no columns. This occurs when one document contains a valid JSON value (such as a string or number) and the other documents contain objects or arrays. When the default case in {{JsonInferSchema.compatibleRootType}} is reached when merging a {{StringType}} and a {{StructType}} the resulting type will be a {{StringType}}, which is then discarded because a {{StructType}} is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871016#comment-15871016 ] Shixiong Zhu commented on SPARK-19497: -- [~samelamin] Thanks! I just submitted a PR. Could you help review it, please? > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Shixiong Zhu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871015#comment-15871015 ] yuhao yang commented on SPARK-19640: Thanks for reporting the issue. Feel free to send a fix if you would like to. > Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2 > -- > > Key: SPARK-19640 > URL: https://issues.apache.org/jira/browse/SPARK-19640 > Project: Spark > Issue Type: Task > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Stephen Kinser >Priority: Trivial > Labels: documentation, easyfix > Fix For: 1.5.2 > > Original Estimate: 0h > Remaining Estimate: 0h > > Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently > uses import statement of package path that does not exist > import org.apache.spark.ml.feature.CountVectorizer > import org.apache.spark.mllib.util.CountVectorizerModel > this should be revised to what it is like in spark 1.6+ > import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19497: Assignee: Shixiong Zhu (was: Apache Spark) > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Shixiong Zhu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871010#comment-15871010 ] Apache Spark commented on SPARK-19497: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16970 > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Shixiong Zhu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19497: Assignee: Apache Spark (was: Shixiong Zhu) > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871000#comment-15871000 ] Stephen Kinser commented on SPARK-19640: [~yuhaoyan] I saw that you were the one who originally wrote the documentation from SPARK-9890 , should you make the change or should I? > Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2 > -- > > Key: SPARK-19640 > URL: https://issues.apache.org/jira/browse/SPARK-19640 > Project: Spark > Issue Type: Task > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Stephen Kinser >Priority: Trivial > Labels: documentation, easyfix > Fix For: 1.5.2 > > Original Estimate: 0h > Remaining Estimate: 0h > > Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently > uses import statement of package path that does not exist > import org.apache.spark.ml.feature.CountVectorizer > import org.apache.spark.mllib.util.CountVectorizerModel > this should be revised to what it is like in spark 1.6+ > import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
Stephen Kinser created SPARK-19640: -- Summary: Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2 Key: SPARK-19640 URL: https://issues.apache.org/jira/browse/SPARK-19640 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.5.2 Reporter: Stephen Kinser Priority: Trivial Fix For: 1.5.2 Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently uses import statement of package path that does not exist import org.apache.spark.ml.feature.CountVectorizer import org.apache.spark.mllib.util.CountVectorizerModel this should be revised to what it is like in spark 1.6+ import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19623) Take rows from DataFrame with empty first partition
[ https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870990#comment-15870990 ] Jaeboo Jung commented on SPARK-19623: - Increasing driver memory can't clear this issue because memory consumption grows proportionally to the number of partitions. For example, 5g of driver memory processes 1000 partitions properly but OOME occurs in case of 3000 partitions. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,3000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} I think this issue comes from differences of taking rows process between rdd and dataframe. RDD takes rows with its internal method but DataFrame takes rows with Limit SQL process. The weird part is dataframe scanning all the partitions when the first partition is empty. Maybe it is related to SPARK-3211. > Take rows from DataFrame with empty first partition > --- > > Key: SPARK-19623 > URL: https://issues.apache.org/jira/browse/SPARK-19623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Jaeboo Jung >Priority: Minor > > I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions > having a empty first partition, DataFrame and its RDD have different > behaviors during taking rows from it. If we take only 1000 rows from > DataFrame, it causes OOME but RDD is OK. > In detail, > DataFrame without a empty first partition => OK > DataFrame with a empty first partition => OOME > RDD of DataFrame with a empty first partition => OK > Codes below reproduce this error. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(1 to 1,1000).map(i => > Row.fromSeq(Array.fill(100)(i))) > val schema = StructType(for(i <- 1 to 100) yield { > StructField("COL"+i,IntegerType, true) > }) > val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) > Iterator[Row]() else iter) > val df1 = sqlContext.createDataFrame(rdd,schema) > df1.take(1000) // OK > val df2 = sqlContext.createDataFrame(rdd2,schema) > df2.rdd.take(1000) // OK > df2.take(1000) // OOME > {code} > I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor > memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19639) Add spark.svmLinear example and update vignettes
[ https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19639: Assignee: Apache Spark > Add spark.svmLinear example and update vignettes > > > Key: SPARK-19639 > URL: https://issues.apache.org/jira/browse/SPARK-19639 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang >Assignee: Apache Spark > > We recently add the spark.svmLinear API for SparkR. We need to add an example > and update the vignettes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19639) Add spark.svmLinear example and update vignettes
[ https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19639: Assignee: (was: Apache Spark) > Add spark.svmLinear example and update vignettes > > > Key: SPARK-19639 > URL: https://issues.apache.org/jira/browse/SPARK-19639 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang > > We recently add the spark.svmLinear API for SparkR. We need to add an example > and update the vignettes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19639) Add spark.svmLinear example and update vignettes
[ https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870918#comment-15870918 ] Apache Spark commented on SPARK-19639: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/16969 > Add spark.svmLinear example and update vignettes > > > Key: SPARK-19639 > URL: https://issues.apache.org/jira/browse/SPARK-19639 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang > > We recently add the spark.svmLinear API for SparkR. We need to add an example > and update the vignettes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19639) Add spark.svmLinear example and update vignettes
Miao Wang created SPARK-19639: - Summary: Add spark.svmLinear example and update vignettes Key: SPARK-19639 URL: https://issues.apache.org/jira/browse/SPARK-19639 Project: Spark Issue Type: Documentation Components: SparkR Affects Versions: 2.2.0 Reporter: Miao Wang We recently add the spark.svmLinear API for SparkR. We need to add an example and update the vignettes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19638) Filter pushdown not working for struct fields
Nick Dimiduk created SPARK-19638: Summary: Filter pushdown not working for struct fields Key: SPARK-19638 URL: https://issues.apache.org/jira/browse/SPARK-19638 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Nick Dimiduk Working with a dataset containing struct fields, and enabling debug logging in the ES connector, I'm seeing the following behavior. The dataframe is created over the ES connector and then the schema is extended with a couple column aliases, such as. {noformat} df.withColumn("f2", df("foo")) {noformat} Queries vs those alias columns work as expected for fields that are non-struct members. {noformat} scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters [IsNotNull(foo),EqualTo(foo,1)] 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}] {noformat} However, try the same with an alias over a struct field, and no filters are pushed down. {noformat} scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == '1'").limit(1).show {noformat} In fact, this is the case even when no alias is used at all. {noformat} scala> df.where("bar.baz == '1'").limit(1).show {noformat} Basically, pushdown for structs doesn't work at all. Maybe this is specific to the ES connector? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19637) add to_json APIs to SQL
Burak Yavuz created SPARK-19637: --- Summary: add to_json APIs to SQL Key: SPARK-19637 URL: https://issues.apache.org/jira/browse/SPARK-19637 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Burak Yavuz The method "to_json" is a useful method in turning a struct into a json string. It currently doesn't work in SQL, but adding it should be trivial. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870825#comment-15870825 ] Miao Wang commented on SPARK-19634: --- I can give a try. Thanks! Miao > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870816#comment-15870816 ] Josh Rosen commented on SPARK-14658: Here's the logs from my reproduction, excerpted down to only the relevant parts (as near as I can tell): First attempt of task set being submitted: {code} 17/02/13 20:11:59 INFO DAGScheduler: waiting: Set(ShuffleMapStage 3086, ResultStage 3087) 17/02/13 20:11:59 INFO DAGScheduler: failed: Set() 17/02/13 20:11:59 INFO DAGScheduler: Submitting ShuffleMapStage 3086 (MapPartitionsRDD[34696] at cache at :61), which has no missing parents 17/02/13 20:11:59 INFO MemoryStore: Block broadcast_2871 stored as values in memory (estimated size 67.1 KB, free 9.1 GB) 17/02/13 20:11:59 INFO MemoryStore: Block broadcast_2871_piece0 stored as bytes in memory (estimated size 28.1 KB, free 9.1 GB) 17/02/13 20:11:59 INFO BlockManagerInfo: Added broadcast_2871_piece0 in memory on :45333 (size: 28.1 KB, free: 10.6 GB) 17/02/13 20:11:59 INFO SparkContext: Created broadcast 2871 from broadcast at DAGScheduler.scala:996 17/02/13 20:11:59 INFO DAGScheduler: Submitting 2213 missing tasks from ShuffleMapStage 3086 (MapPartitionsRDD[34696] at cache at :61) 17/02/13 20:11:59 INFO TaskSchedulerImpl: Adding task set 3086.0 with 2213 tasks 17/02/13 20:11:59 INFO FairSchedulableBuilder: Added task set TaskSet_3086.0 tasks to pool 1969095006217179029 {code} While the task set was running some tasks failed due to fetch failures from the parent stage, causing both the stage with the fetch failures and the parent stage to be resubmitted: {code} 17/02/13 20:44:34 WARN TaskSetManager: Lost task 1213.0 in stage 3086.0 (TID 370751, , executor 622): FetchFailed(null, shuffleId=638, mapId=-1, reduceId=0, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 638 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693) at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at
[jira] [Assigned] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19337: Assignee: Apache Spark > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870813#comment-15870813 ] Apache Spark commented on SPARK-19337: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/16968 > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19337: Assignee: (was: Apache Spark) > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort
[ https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18409: Assignee: Apache Spark > LSH approxNearestNeighbors should use approxQuantile instead of sort > > > Key: SPARK-18409 > URL: https://issues.apache.org/jira/browse/SPARK-18409 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in > order to find a threshold. It should use approxQuantile instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort
[ https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870791#comment-15870791 ] Apache Spark commented on SPARK-18409: -- User 'Yunni' has created a pull request for this issue: https://github.com/apache/spark/pull/16966 > LSH approxNearestNeighbors should use approxQuantile instead of sort > > > Key: SPARK-18409 > URL: https://issues.apache.org/jira/browse/SPARK-18409 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in > order to find a threshold. It should use approxQuantile instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort
[ https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18409: Assignee: (was: Apache Spark) > LSH approxNearestNeighbors should use approxQuantile instead of sort > > > Key: SPARK-18409 > URL: https://issues.apache.org/jira/browse/SPARK-18409 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in > order to find a threshold. It should use approxQuantile instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18286) Add Scala/Java/Python examples for MinHash and RandomProjection
[ https://issues.apache.org/jira/browse/SPARK-18286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni resolved SPARK-18286. Resolution: Fixed Fix Version/s: 2.2.0 > Add Scala/Java/Python examples for MinHash and RandomProjection > --- > > Key: SPARK-18286 > URL: https://issues.apache.org/jira/browse/SPARK-18286 > Project: Spark > Issue Type: Improvement > Components: Examples, ML >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.2.0 > > > Add Scala/Java/Python examples for MinHash and RandomProjection -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()
[ https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780 ] Nicholas Chammas commented on SPARK-19553: -- The utility of 1) would be being able to count items instead of distinct items, unless I misunderstood what you're saying. I would imagine that just counting items (as opposed to distinct items) would be cheaper, in addition to being semantically different. I'll open a PR for 3), unless someone else wants to step in and do that. > Add GroupedData.countApprox() > - > > Key: SPARK-19553 > URL: https://issues.apache.org/jira/browse/SPARK-19553 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We already have a > [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct] > that can be applied to grouped data, but it seems odd that you can't just > get regular approximate count for grouped data. > I imagine the API would mirror that for > [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox], > but I'm not sure: > {code} > (df > .groupBy('col1') > .countApprox(timeout=300, confidence=0.95) > .show()) > {code} > Or, if we want to mirror the {{approx_count_distinct()}} function, we can do > that too. I'd want to understand why that function doesn't take a timeout or > confidence parameter, though. Also, what does {{rsd}} mean? It's not > documented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14658: --- Description: {code} 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.IllegalStateException: more than one active taskSet for stage 57: 57.2,57.1 at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} First Time: {code} 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no missing parents 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) {code} Second Time: {code} 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at AccessController.java:-2) because some of its tasks had failed: 26 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no missing parents 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) {code} was: 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.IllegalStateException: more than one active taskSet for stage 57: 57.2,57.1 at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) First Time: 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99,
[jira] [Commented] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870777#comment-15870777 ] Josh Rosen commented on SPARK-14658: [~srowen], I think that [~yixiaohua] is right here: it looks like SPARK-14649 is proposing a performance optimization to avoid the wasteful resubmission of tasks when failures occur (which is only a perf. optimization, although an important one for the workload that [~sitalke...@gmail.com] describes), whereas this ticket is discussing a correctness issue where one of the scheduler's internal invariants is being violated, causing a total SparkContext shutdown. I'm going to unmark this as a duplicate and will re-open so that someone can review the proposed fix. I also have additional logs from a fresh occurrence of this bug in Spark 2.1.0+, which I'll upload in a followup comment here. > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > First Time: > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > Second Time: > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22
[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14658: --- Affects Version/s: 2.2.0 2.0.0 2.1.0 > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > First Time: > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > Second Time: > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14658: --- Component/s: (was: Spark Core) Scheduler > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > First Time: > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > Second Time: > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-14658: > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > First Time: > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > Second Time: > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19628: -- Fix Version/s: (was: 2.0.1) > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing
[ https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18450: Assignee: (was: Apache Spark) > Add AND-amplification to Locality Sensitive Hashing > --- > > Key: SPARK-18450 > URL: https://issues.apache.org/jira/browse/SPARK-18450 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. > Once the change is applied, we can add AND-amplification to LSH > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing
[ https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870727#comment-15870727 ] Apache Spark commented on SPARK-18450: -- User 'Yunni' has created a pull request for this issue: https://github.com/apache/spark/pull/16965 > Add AND-amplification to Locality Sensitive Hashing > --- > > Key: SPARK-18450 > URL: https://issues.apache.org/jira/browse/SPARK-18450 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. > Once the change is applied, we can add AND-amplification to LSH > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19163) Lazy creation of the _judf
[ https://issues.apache.org/jira/browse/SPARK-19163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19163: - Assignee: Maciej Szymkiewicz > Lazy creation of the _judf > -- > > Key: SPARK-19163 > URL: https://issues.apache.org/jira/browse/SPARK-19163 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > > Current state > Right {{UserDefinedFunction}} eagerly creates {{_judf}} and initializes > {{SparkSession}} > (https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L1832) > as a side effect. This behavior may have undesired results when {{udf}} is > imported from a module: > {{myudfs.py}} > {code} > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > > def _add_one(x): > """Adds one""" > if x is not None: > return x + 1 > > add_one = udf(_add_one, IntegerType()) > {code} > > > Example session: > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: from myudfs import add_one > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/01/07 19:55:44 WARN Utils: Your hostname, xxx resolves to a loopback > address: 127.0.1.1; using xxx instead (on interface eth0) > 17/01/07 19:55:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > In [3]: spark = SparkSession.builder.appName("foo").getOrCreate() > In [4]: spark.sparkContext.appName > Out[4]: 'pyspark-shell' > {code} > Proposed > Delay {{_judf}} initialization until the first call. > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: from myudfs import add_one > In [3]: spark = SparkSession.builder.appName("foo").getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/01/07 19:58:38 WARN Utils: Your hostname, xxx resolves to a loopback > address: 127.0.1.1; using xxx instead (on interface eth0) > 17/01/07 19:58:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > In [4]: spark.sparkContext.appName > Out[4]: 'foo' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing
[ https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18450: Assignee: Apache Spark > Add AND-amplification to Locality Sensitive Hashing > --- > > Key: SPARK-18450 > URL: https://issues.apache.org/jira/browse/SPARK-18450 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni >Assignee: Apache Spark > > We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. > Once the change is applied, we can add AND-amplification to LSH > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19586) Incorrect push down filter for double negative in SQL
[ https://issues.apache.org/jira/browse/SPARK-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19586: - Assignee: Xiao Li > Incorrect push down filter for double negative in SQL > - > > Key: SPARK-19586 > URL: https://issues.apache.org/jira/browse/SPARK-19586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Everett Anderson >Assignee: Xiao Li > Fix For: 2.0.3 > > > Opening this as it's a somewhat serious issue in the 2.0.x tree in case > there's a 2.0.3 planned, but it is fixed in 2.1.0. > While it works in 1.6.2 and 2.1.0, it appears 2.0.2 has a significant filter > optimization error. > Example: > {noformat} > // Create some fake data > import org.apache.spark.sql.Row > import org.apache.spark.sql.Dataset > import org.apache.spark.sql.types._ > val rowsRDD = sc.parallelize(Seq( > Row(1, "fred"), > Row(2, "amy"), > Row(3, null))) > val schema = StructType(Seq( > StructField("id", IntegerType, nullable = true), > StructField("username", StringType, nullable = true))) > > val data = sqlContext.createDataFrame(rowsRDD, schema) > val path = "/tmp/test_data" > data.write.mode("overwrite").parquet(path) > val testData = sqlContext.read.parquet(path) > testData.registerTempTable("filter_test_table") > {noformat} > {noformat} > %sql > explain select count(*) from filter_test_table where not( username is not > null) > {noformat} > or > {noformat} > spark.sql("select count(*) from filter_test_table where not( username is not > null)").explain > {noformat} > In 2.0.2, I'm seeing > {noformat} > == Physical Plan == > *HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)]) > +- *Project > +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35)) > +- *BatchedScan parquet default.[username#35] Format: > ParquetFormat, InputPaths: , PartitionFilters: [], > PushedFilters: [IsNotNull(username), Not(IsNotNull(username))], ReadSchema: > struct > {noformat} > which seems like both an impossible Filter and an impossible pushed filter. > In Spark 1.6.2 it was > {noformat} > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#1822L]) > +- TungstenExchange SinglePartition, None > +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#1825L]) > +- Project > +- Filter NOT isnotnull(username#1590) > +- Scan ParquetRelation[username#1590] InputPaths: , > PushedFilters: [Not(IsNotNull(username))] > {noformat} > and 2.1.0 it's working again as > {noformat} > == Physical Plan == > *HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[partial_count(1)]) > +- *Project > +- *Filter NOT isnotnull(username#14) > +- *FileScan parquet [username#14] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[file:/tmp/test_table], PartitionFilters: > [], PushedFilters: [Not(IsNotNull(username))], ReadSchema: > struct > {noformat} > while it's easy for humans in interactive cases to work around this by > removing the double negative, it's a bit harder if it's a programmatic > construct. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-16043: - Assignee: Kazuaki Ishizaki > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19636) Feature parity for correlation statistics in MLlib
Timothy Hunter created SPARK-19636: -- Summary: Feature parity for correlation statistics in MLlib Key: SPARK-19636 URL: https://issues.apache.org/jira/browse/SPARK-19636 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.Statistics.corr() over to spark.ml. Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
Timothy Hunter created SPARK-19635: -- Summary: Feature parity for Chi-square hypothesis testing in MLlib Key: SPARK-19635 URL: https://issues.apache.org/jira/browse/SPARK-19635 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.Statistics.chiSqTest over to spark.ml. Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19634) Feature parity for descriptive statistics in MLlib
Timothy Hunter created SPARK-19634: -- Summary: Feature parity for descriptive statistics in MLlib Key: SPARK-19634 URL: https://issues.apache.org/jira/browse/SPARK-19634 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.MultivariateOnlineSummarizer over to spark.ml. A design has been discussed in SPARK-19208 . Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19557) Output parameters are not present in SQL Query Plan
[ https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870664#comment-15870664 ] Apache Spark commented on SPARK-19557: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16962 > Output parameters are not present in SQL Query Plan > --- > > Key: SPARK-19557 > URL: https://issues.apache.org/jira/browse/SPARK-19557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran > > For DataFrameWriter methods like parquet(), json(), csv() etc. output > parameters are not present in the QueryExecution object. For methods like > saveAsTable() they do. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan
[ https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19557: Assignee: Apache Spark > Output parameters are not present in SQL Query Plan > --- > > Key: SPARK-19557 > URL: https://issues.apache.org/jira/browse/SPARK-19557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran >Assignee: Apache Spark > > For DataFrameWriter methods like parquet(), json(), csv() etc. output > parameters are not present in the QueryExecution object. For methods like > saveAsTable() they do. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan
[ https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19557: Assignee: (was: Apache Spark) > Output parameters are not present in SQL Query Plan > --- > > Key: SPARK-19557 > URL: https://issues.apache.org/jira/browse/SPARK-19557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran > > For DataFrameWriter methods like parquet(), json(), csv() etc. output > parameters are not present in the QueryExecution object. For methods like > saveAsTable() they do. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870655#comment-15870655 ] Timothy Hunter commented on SPARK-19208: I put together the ideas in this thread into a document. I will update the umbrella ticket with sub tasks once folks have had a chance to comment: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski closed SPARK-19632. - Resolution: Won't Fix > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870649#comment-15870649 ] Robert Kruszewski commented on SPARK-19632: --- Thanks, I must have missed that. > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features
[ https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870638#comment-15870638 ] Apache Spark commented on SPARK-19534: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/16964 > Convert Java tests to use lambdas, Java 8 features > -- > > Key: SPARK-19534 > URL: https://issues.apache.org/jira/browse/SPARK-19534 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 2.2.0 >Reporter: Sean Owen >Assignee: Sean Owen > > Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a > significant sub-task in its own right. This shouldn't mean that 'old' APIs go > untested because there are no separate Java 8 APIs; it's just syntactic sugar > for calls to the same APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870630#comment-15870630 ] Dongjoon Hyun commented on SPARK-19632: --- Hi, [~robert3005]. I just added a previous JIRA issue link for this issue. > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19633) FileSource read from FileSink
Michael Armbrust created SPARK-19633: Summary: FileSource read from FileSink Key: SPARK-19633 URL: https://issues.apache.org/jira/browse/SPARK-19633 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Michael Armbrust Priority: Critical Right now, you can't start a streaming query from a location that is being written to by the file sink. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19632: Assignee: Apache Spark > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski >Assignee: Apache Spark > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19632: Assignee: (was: Apache Spark) > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870617#comment-15870617 ] Apache Spark commented on SPARK-19632: -- User 'robert3005' has created a pull request for this issue: https://github.com/apache/spark/pull/16963 > Allow configuring non-hive and non-local SessionState and ExternalCatalog > - > > Key: SPARK-19632 > URL: https://issues.apache.org/jira/browse/SPARK-19632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > I want to be able to integrate non hive catalogs into spark. It seems api is > pretty close already. This issue proposes opening up existing interfaces for > external implementation. > For ExternalCatalog there seems to be already an abstract class and > SessionState java doc suggests it should stay a scala class to avoid > initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf
[ https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870616#comment-15870616 ] Abhishek Madav commented on SPARK-17302: I believe this is fixed as part of SPARK-15887. Could you check? > Cannot set non-Spark SQL session variables in hive-site.xml, > spark-defaults.conf, or using --conf > - > > Key: SPARK-17302 > URL: https://issues.apache.org/jira/browse/SPARK-17302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > When configuration changed for 2.0 to the new SparkSession structure, Spark > stopped using Hive's internal HiveConf for session state and now uses > HiveSessionState and an associated SQLConf. Now, session options like > hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled > from this SQLConf. This doesn't include session properties from hive-site.xml > (including hive.exec.compress.output), and no longer contains Spark-specific > overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern. > Also, setting these variables on the command-line no longer works because > settings must start with "spark.". > Is there a recommended way to set Hive session properties? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19628: - Component/s: (was: Spark Core) SQL > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog
Robert Kruszewski created SPARK-19632: - Summary: Allow configuring non-hive and non-local SessionState and ExternalCatalog Key: SPARK-19632 URL: https://issues.apache.org/jira/browse/SPARK-19632 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Robert Kruszewski I want to be able to integrate non hive catalogs into spark. It seems api is pretty close already. This issue proposes opening up existing interfaces for external implementation. For ExternalCatalog there seems to be already an abstract class and SessionState java doc suggests it should stay a scala class to avoid initialization order -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
[ https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870586#comment-15870586 ] Apache Spark commented on SPARK-18120: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16962 > QueryExecutionListener method doesnt' get executed for DataFrameWriter methods > -- > > Key: SPARK-18120 > URL: https://issues.apache.org/jira/browse/SPARK-18120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Salil Surendran > > QueryExecutionListener is a class that has methods named onSuccess() and > onFailure() that gets called when a query is executed. Each of those methods > takes a QueryExecution object as a parameter which can be used for metrics > analysis. It gets called for several of the DataSet methods like take, head, > first, collect etc. but doesn't get called for any of the DataFrameWriter > methods like saveAsTable, save etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18891) Support for specific collection types
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870570#comment-15870570 ] Michal Šenkýř commented on SPARK-18891: --- Started my work on Map support as there is still no progress on the optimization PR for Seq. I chose the builder approach as it seemed more straightforward. I will keep the Map implementation seperate from the Seq one to have it working even in case the Seq PR doesn't get approved. > Support for specific collection types > - > > Key: SPARK-18891 > URL: https://issues.apache.org/jira/browse/SPARK-18891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > > Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which > force users to only define classes with the most generic type. > An [example > error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: > {code} > case class SpecificCollection(aList: List[Int]) > Seq(SpecificCollection(1 :: Nil)).toDS().collect() > {code} > {code} > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 98, Column 120: No applicable constructor/method found > for actual parameters "scala.collection.Seq"; candidates are: > "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19614) add type-preserving null function
[ https://issues.apache.org/jira/browse/SPARK-19614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870525#comment-15870525 ] Nick Dimiduk commented on SPARK-19614: -- {{lit(null).cast(type)}} does exactly what I needed. Thanks fellas. > add type-preserving null function > - > > Key: SPARK-19614 > URL: https://issues.apache.org/jira/browse/SPARK-19614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Trivial > > There's currently no easy way to extend the columns of a DataFrame with null > columns that also preserves the type. {{lit(null)}} evaluates to > {{Literal(null, NullType)}}, despite any subsequent hinting, for instance > with {{Column.as(String, Metadata)}}. This comes up when programmatically > munging data from disparate sources. A function such as {{null(DataType)}} > would be nice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19614) add type-preserving null function
[ https://issues.apache.org/jira/browse/SPARK-19614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk closed SPARK-19614. Resolution: Invalid > add type-preserving null function > - > > Key: SPARK-19614 > URL: https://issues.apache.org/jira/browse/SPARK-19614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Trivial > > There's currently no easy way to extend the columns of a DataFrame with null > columns that also preserves the type. {{lit(null)}} evaluates to > {{Literal(null, NullType)}}, despite any subsequent hinting, for instance > with {{Column.as(String, Metadata)}}. This comes up when programmatically > munging data from disparate sources. A function such as {{null(DataType)}} > would be nice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly
[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19617: - Description: The streaming thread in StreamExecution uses the following ways to check if it should exit: - Catch an InterruptException. - `StreamExecution.state` is TERMINATED. when starting and stopping a query quickly, the above two checks may both fail. - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and swallow InterruptException - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252] changes the state from `TERMINATED` to `ACTIVE`. If the above cases both happen, the query will hang forever. was: Saw the following exception in some test log: {code} 17/02/14 21:20:10.987 stream execution thread for this_query [id = 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: Thread[Thread-48,5,main] java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1249) at java.lang.Thread.join(Thread.java:1323) at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) at org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) {code} This is the cause of some test timeout failures on Jenkins. > Fix the race condition when starting and stopping a query quickly > - > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects
[jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly
[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19617: - Summary: Fix the race condition when starting and stopping a query quickly (was: Fix a case that a query may not stop due to HADOOP-14084) > Fix the race condition when starting and stopping a query quickly > - > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Saw the following exception in some test log: > {code} > 17/02/14 21:20:10.987 stream execution thread for this_query [id = > 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = > a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining > on: Thread[Thread-48,5,main] > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) > at org.apache.hadoop.util.Shell.run(Shell.java:479) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) > at > org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) > at > org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) > {code} > This is the cause of some test timeout failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870514#comment-15870514 ] yuhao yang commented on SPARK-19337: Sure, I'll send a PR today. > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870513#comment-15870513 ] Nick Dimiduk commented on SPARK-19615: -- IMHO, a union operation should be as generous as possible. This facilitates common ETL and data cleansing operations where the sources are sparse-schema structures (JSON, HBase, Elastic Search, ). A couple examples of what I mean. Given dataframes of type {noformat} root |-- a: string (nullable = false) |-- b: string (nullable = true) {noformat} and {noformat} root |-- a: string (nullable = false) |-- c: string (nullable = true) {noformat} I would expect the union operation to infer the nullable columns from both sides to produce a dataframe of type {noformat} root |-- a: string (nullable = false) |-- b: string (nullable = true) |-- c: string (nullable = true) {noformat} This should work on an arbitrarily deep nesting of structs, so {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b1: string (nullable = true) ||-- b2: string (nullable = true) {noformat} unioned with {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b3: string (nullable = true) ||-- b4: string (nullable = true) {noformat} would result in {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b1: string (nullable = true) ||-- b2: string (nullable = true) ||-- b3: string (nullable = true) ||-- b4: string (nullable = true) {noformat} > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Minor > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk updated SPARK-19615: - Priority: Major (was: Minor) > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org