date:20170216

[jira] [Updated] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2017-02-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14194:
--
Affects Version/s: 2.1.0

> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.1.0
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-16 Thread BahaaEddin AlAila (JIRA)

BahaaEddin AlAila created SPARK-19646:
-

 Summary: binaryRecords replicates records in scala API
 Key: SPARK-19646
 URL: https://issues.apache.org/jira/browse/SPARK-19646
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.0
Reporter: BahaaEddin AlAila
Priority: Minor


The scala sc.binaryRecords replicates one record for the entire set.
for example, I am trying to load the cifar binary data where in a big binary 
file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
label. The file resides on my local filesystem.
.take(5) returns 5 records all the same, .collect() returns 10,000 records all 
the same.
What is puzzling is that the pyspark one works perfectly even though underneath 
it is calling the scala implementation.
I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19645) structured streaming job restart bug

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute WAL 
offsets and generate the same hdfs delta file(latest delta file generated 
before restart and named "currentBatchId.delta") . In my opinion, this is a 
bug. If you guy consider that  this is a bug also,  I can fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at

[jira] [Updated] (SPARK-19645) structured streaming job restart bug

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute WAL 
offsets and generate the same hdfs delta file(generated before restart and 
named "currentBatchId.delta") . In my opinion, this is a bug. If you guy 
consider that  this is a bug also,  I can fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)

[jira] [Updated] (SPARK-19645) structured streaming job restart bug

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute WAL 
offsets and generate the same hdfs delta file named "currentBatchId.delta". In 
my opinion, this is a bug. If you guy consider that  this is a bug also,  I can 
fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at

[jira] [Updated] (SPARK-19645) structured streaming job restart bug

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Summary: structured streaming job restart bug  (was: structured streaming 
job restart)

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file whose name is 
> "currentBatchId.delta". In my opinion, this is a bug. If you guy consider 
> that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19645) structured streaming job restart

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute WAL 
offsets and generate the same hdfs delta file whose name is 
"currentBatchId.delta". In my opinion, this is a bug. If you guy consider that  
this is a bug also,  I can fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at

[jira] [Updated] (SPARK-19645) structured streaming job restart

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute WAL 
offsets and generate the same hdfs delta file whose name is 
"currentBatchId.delta". In my opinion, this is a bug, and if you guy consider 
that  this is a bug also,  I can fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at

[jira] [Updated] (SPARK-19645) structured streaming job restart

2017-02-16 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute wal 
offsets and generate the same hdfs delta file whose name is 
"currentBatchId.delta". In my opinion, this is a bug, and if you guy consider 
that  this is a bug also,  I can fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284
 to /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)

[jira] [Created] (SPARK-19645) structured streaming job restart

2017-02-16 Thread guifeng (JIRA)

guifeng created SPARK-19645:
---

 Summary: structured streaming job restart
 Key: SPARK-19645
 URL: https://issues.apache.org/jira/browse/SPARK-19645
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: guifeng
Priority: Critical


We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284
 to /tmp/PipelinePolicy_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduce when restart previous streaming job, and 
the main reason is that when restart streaming  job spark will recompute wal 
offsets and generate the same hdfs delta file whose name is 
"currentBatchId.delta". In my opinion, this is a bug, and if you guy consider 
that  this is a bug also,  I can fix it.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19644) Memory leak in Spark Streaming

2017-02-16 Thread Deenbandhu Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deenbandhu Agarwal updated SPARK-19644:
---
Affects Version/s: (was: 2.0.1)
   2.0.2
  Environment: 
3 AWS EC2 c3.xLarge
Number of cores - 3
Number of executors 3 
Memory to each executor 2GB

  was:
3 AWS EC2 c3.xLarge
Number of cores - 3
Number of executers 3 
Memory to each executor 2GB

   Labels: memory_leak performance  (was: performance)
  Component/s: (was: Structured Streaming)
   DStreams

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19644) Memory leak in Spark Streaming

2017-02-16 Thread Deenbandhu Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deenbandhu Agarwal updated SPARK-19644:
---
Attachment: heapdump.png

Snap shot of heap dump after 50 hours

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executers 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19644) Memory leak in Spark Streaming

2017-02-16 Thread Deenbandhu Agarwal (JIRA)

Deenbandhu Agarwal created SPARK-19644:
--

 Summary: Memory leak in Spark Streaming
 Key: SPARK-19644
 URL: https://issues.apache.org/jira/browse/SPARK-19644
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.1
 Environment: 3 AWS EC2 c3.xLarge
Number of cores - 3
Number of executers 3 
Memory to each executor 2GB
Reporter: Deenbandhu Agarwal
Priority: Critical


I am using streaming on the production for some aggregation and fetching data 
from cassandra and saving data back to cassandra. 

I see a gradual increase in old generation heap capacity from 1161216 Bytes to 
1397760 Bytes over a period of six hours.

After 50 hours of processing instances of class 
scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a huge 
number. 

I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-02-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871202#comment-15871202
 ] 

Takeshi Yamamuro commented on SPARK-19641:
--

okay, thanks!

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-02-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871196#comment-15871196
 ] 

Hyukjin Kwon commented on SPARK-19641:
--

Ah, thanks for cc'ing me. I happened to see the related comment. I will leave 
them here just in case anyone is curious.

https://github.com/apache/spark/pull/16386#issuecomment-280525182
https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19557:
---

Assignee: Wenchen Fan

> Output parameters are not present in SQL Query Plan
> ---
>
> Key: SPARK-19557
> URL: https://issues.apache.org/jira/browse/SPARK-19557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> For DataFrameWriter methods like parquet(), json(), csv() etc. output 
> parameters are not present in the QueryExecution object. For methods like 
> saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18120:
---

Assignee: Wenchen Fan

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19557.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16962
[https://github.com/apache/spark/pull/16962]

> Output parameters are not present in SQL Query Plan
> ---
>
> Key: SPARK-19557
> URL: https://issues.apache.org/jira/browse/SPARK-19557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
> Fix For: 2.2.0
>
>
> For DataFrameWriter methods like parquet(), json(), csv() etc. output 
> parameters are not present in the QueryExecution object. For methods like 
> saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18120.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16962
[https://github.com/apache/spark/pull/16962]

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
> Fix For: 2.2.0
>
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18352:
---

Assignee: Nathan Howell

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nathan Howell
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2017-02-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18352.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16386
[https://github.com/apache/spark/pull/16386]

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871162#comment-15871162
 ] 

Takeshi Yamamuro commented on SPARK-19638:
--

The pushing-down stuffs depend on datasources, so this is specific to the ES 
connector.
There is no log entry "Pushing down filters..." in spark, so you'd better to 
ask the author community of the connector.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19643) Document how to use Spark/SparkR on Windows

2017-02-16 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-19643:


 Summary: Document how to use Spark/SparkR on Windows
 Key: SPARK-19643
 URL: https://issues.apache.org/jira/browse/SPARK-19643
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


winutl and/or HADOOP_HOME is required to run Spark on Windows, since we have 
SparkR package that can be run on Windows machine, we should document this in 
programming guide and more importantly vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-02-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871152#comment-15871152
 ] 

Takeshi Yamamuro commented on SPARK-19641:
--

Could you show us a simple query to reproduce this? cc: [~hyukjin.kwon]

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19556:


Assignee: (was: Apache Spark)

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19556:


Assignee: Apache Spark

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871114#comment-15871114
 ] 

Apache Spark commented on SPARK-19573:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16971

> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19573:


Assignee: (was: Apache Spark)

> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19573:


Assignee: Apache Spark

> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19623) Take rows from DataFrame with empty first partition

2017-02-16 Thread Jaeboo Jung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870990#comment-15870990
 ] 

Jaeboo Jung edited comment on SPARK-19623 at 2/17/17 2:15 AM:
--

Increasing driver memory can't clear this issue because memory consumption 
grows proportionally to the number of partitions. For example, 5g of driver 
memory processes 1000 partitions properly but OOME occurs in case of 3000 
partitions.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,3000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
I think this issue comes from differences of taking rows process between rdd 
and dataframe. RDD takes rows with its internal method but DataFrame takes rows 
with Limit SQL process. The weird part is dataframe seeks all the partitions 
when the first partition is empty. Maybe it is related to SPARK-3211.


was (Author: jb jung):
Increasing driver memory can't clear this issue because memory consumption 
grows proportionally to the number of partitions. For example, 5g of driver 
memory processes 1000 partitions properly but OOME occurs in case of 3000 
partitions.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,3000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
I think this issue comes from differences of taking rows process between rdd 
and dataframe. RDD takes rows with its internal method but DataFrame takes rows 
with Limit SQL process. The weird part is dataframe scanning all the partitions 
when the first partition is empty. Maybe it is related to SPARK-3211.

> Take rows from DataFrame with empty first partition
> ---
>
> Key: SPARK-19623
> URL: https://issues.apache.org/jira/browse/SPARK-19623
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jaeboo Jung
>Priority: Minor
>
> I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
> having a empty first partition, DataFrame and its RDD have different 
> behaviors during taking rows from it. If we take only 1000 rows from 
> DataFrame, it causes OOME but RDD is OK.
> In detail,
> DataFrame without a empty first partition => OK
> DataFrame with a empty first partition => OOME
> RDD of DataFrame with a empty first partition => OK
> Codes below reproduce this error.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(1 to 1,1000).map(i => 
> Row.fromSeq(Array.fill(100)(i)))
> val schema = StructType(for(i <- 1 to 100) yield {
> StructField("COL"+i,IntegerType, true)
> })
> val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
> Iterator[Row]() else iter)
> val df1 = sqlContext.createDataFrame(rdd,schema)
> df1.take(1000) // OK
> val df2 = sqlContext.createDataFrame(rdd2,schema)
> df2.rdd.take(1000) // OK
> df2.take(1000) // OOME
> {code}
> I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api and ui

2017-02-16 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19642:
--
Summary: Improve the security guarantee for rest api and ui  (was: Improve 
the security guarantee for rest api)

> Improve the security guarantee for rest api and ui
> --
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Critical
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). Also current rest 
> api may be a security hole. Open this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api

2017-02-16 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19642:
--
Description: As Spark gets more and more features, data may start leaking 
through other places (e.g. SQL query plans which are shown in the UI). Also 
current rest api may be a security hole. Open this JIRA to research and address 
the potential security flaws.  (was: As Spark gets more and more features, data 
may start leaking through other places (e.g. SQL query plans which are shown in 
the UI). Also current rest api may be a security hole. Opening this JIRA to 
research and address the potential security flaws.)

> Improve the security guarantee for rest api
> ---
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Critical
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). Also current rest 
> api may be a security hole. Open this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api

2017-02-16 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19642:
--
Description: As Spark gets more and more features, data may start leaking 
through other places (e.g. SQL query plans which are shown in the UI). Also 
current rest api may be a security hole. Opening this JIRA to research and 
address the potential security flaws.  (was: As Spark gets more and more 
features, data may start leaking through other places (e.g. SQL query plans 
which are shown in the UI). And current rest api may be a security hole. 
Opening this JIRA to research and address the potential security flaws.)

> Improve the security guarantee for rest api
> ---
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Critical
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). Also current rest 
> api may be a security hole. Opening this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19642) Improve the security guarantee for rest api

2017-02-16 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871058#comment-15871058
 ] 

Genmao Yu commented on SPARK-19642:
---

cc [~ajbozarth], [~vanzin] and [~srowen]

> Improve the security guarantee for rest api
> ---
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Critical
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). And current rest api 
> may be a security hole. Opening this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19642) Improve the security guarantee for rest api

2017-02-16 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19642:
-

 Summary: Improve the security guarantee for rest api
 Key: SPARK-19642
 URL: https://issues.apache.org/jira/browse/SPARK-19642
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Critical


As Spark gets more and more features, data may start leaking through other 
places (e.g. SQL query plans which are shown in the UI). And current rest api 
may be a security hole. Opening this JIRA to research and address the potential 
security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19625) Authorization Support(on all operations not only DDL) in Spark Sql version 2.1.0

2017-02-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-19625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871035#comment-15871035
 ] 

翟玉勇 commented on SPARK-19625:
-

for spark 1.5.2 has 
https://github.com/apache/spark/pull/10144
https://github.com/apache/spark/pull/11045

but can not work with 2.1.0

can show your code？zhaiyuy...@126.com 
thx 

> Authorization Support(on all operations not only DDL) in Spark Sql version 
> 2.1.0
> 
>
> Key: SPARK-19625
> URL: https://issues.apache.org/jira/browse/SPARK-19625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-02-16 Thread Nathan Howell (JIRA)

Nathan Howell created SPARK-19641:
-

 Summary: JSON schema inference in DROPMALFORMED mode produces 
incorrect schema
 Key: SPARK-19641
 URL: https://issues.apache.org/jira/browse/SPARK-19641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nathan Howell


In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
columns. This occurs when one document contains a valid JSON value (such as a 
string or number) and the other documents contain objects or arrays.

When the default case in {{JsonInferSchema.compatibleRootType}} is reached when 
merging a {{StringType}} and a {{StructType}} the resulting type will be a 
{{StringType}}, which is then discarded because a {{StructType}} is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19497) dropDuplicates with watermark

2017-02-16 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871016#comment-15871016
 ] 

Shixiong Zhu commented on SPARK-19497:
--

[~samelamin] Thanks! I just submitted a PR. Could you help review it, please?

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-16 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871015#comment-15871015
 ] 

yuhao yang commented on SPARK-19640:


Thanks for reporting the issue. Feel free to send a fix if you would like to.

> Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
> --
>
> Key: SPARK-19640
> URL: https://issues.apache.org/jira/browse/SPARK-19640
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Stephen Kinser
>Priority: Trivial
>  Labels: documentation, easyfix
> Fix For: 1.5.2
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
> uses import statement of package path that does not exist 
> import org.apache.spark.ml.feature.CountVectorizer
> import org.apache.spark.mllib.util.CountVectorizerModel
> this should be revised to what it is like in spark 1.6+
> import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19497:


Assignee: Shixiong Zhu  (was: Apache Spark)

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19497) dropDuplicates with watermark

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871010#comment-15871010
 ] 

Apache Spark commented on SPARK-19497:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16970

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19497:


Assignee: Apache Spark  (was: Shixiong Zhu)

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-16 Thread Stephen Kinser (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871000#comment-15871000
 ] 

Stephen Kinser commented on SPARK-19640:


[~yuhaoyan]
I saw that you were the one who originally wrote the documentation from 
SPARK-9890 , should you make the change or should I?

> Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
> --
>
> Key: SPARK-19640
> URL: https://issues.apache.org/jira/browse/SPARK-19640
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Stephen Kinser
>Priority: Trivial
>  Labels: documentation, easyfix
> Fix For: 1.5.2
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
> uses import statement of package path that does not exist 
> import org.apache.spark.ml.feature.CountVectorizer
> import org.apache.spark.mllib.util.CountVectorizerModel
> this should be revised to what it is like in spark 1.6+
> import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-16 Thread Stephen Kinser (JIRA)

Stephen Kinser created SPARK-19640:
--

 Summary: Incorrect documentation for MLlib CountVectorizerModel 
for spark 1.5.2
 Key: SPARK-19640
 URL: https://issues.apache.org/jira/browse/SPARK-19640
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.5.2
Reporter: Stephen Kinser
Priority: Trivial
 Fix For: 1.5.2


Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
uses import statement of package path that does not exist 

import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.mllib.util.CountVectorizerModel

this should be revised to what it is like in spark 1.6+

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19623) Take rows from DataFrame with empty first partition

2017-02-16 Thread Jaeboo Jung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870990#comment-15870990
 ] 

Jaeboo Jung commented on SPARK-19623:
-

Increasing driver memory can't clear this issue because memory consumption 
grows proportionally to the number of partitions. For example, 5g of driver 
memory processes 1000 partitions properly but OOME occurs in case of 3000 
partitions.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,3000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
I think this issue comes from differences of taking rows process between rdd 
and dataframe. RDD takes rows with its internal method but DataFrame takes rows 
with Limit SQL process. The weird part is dataframe scanning all the partitions 
when the first partition is empty. Maybe it is related to SPARK-3211.

> Take rows from DataFrame with empty first partition
> ---
>
> Key: SPARK-19623
> URL: https://issues.apache.org/jira/browse/SPARK-19623
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jaeboo Jung
>Priority: Minor
>
> I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
> having a empty first partition, DataFrame and its RDD have different 
> behaviors during taking rows from it. If we take only 1000 rows from 
> DataFrame, it causes OOME but RDD is OK.
> In detail,
> DataFrame without a empty first partition => OK
> DataFrame with a empty first partition => OOME
> RDD of DataFrame with a empty first partition => OK
> Codes below reproduce this error.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(1 to 1,1000).map(i => 
> Row.fromSeq(Array.fill(100)(i)))
> val schema = StructType(for(i <- 1 to 100) yield {
> StructField("COL"+i,IntegerType, true)
> })
> val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
> Iterator[Row]() else iter)
> val df1 = sqlContext.createDataFrame(rdd,schema)
> df1.take(1000) // OK
> val df2 = sqlContext.createDataFrame(rdd2,schema)
> df2.rdd.take(1000) // OK
> df2.take(1000) // OOME
> {code}
> I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19639:


Assignee: Apache Spark

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19639:


Assignee: (was: Apache Spark)

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870918#comment-15870918
 ] 

Apache Spark commented on SPARK-19639:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16969

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-16 Thread Miao Wang (JIRA)

Miao Wang created SPARK-19639:
-

 Summary: Add spark.svmLinear example and update vignettes
 Key: SPARK-19639
 URL: https://issues.apache.org/jira/browse/SPARK-19639
 Project: Spark
  Issue Type: Documentation
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Miao Wang


We recently add the spark.svmLinear API for SparkR. We need to add an example 
and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-16 Thread Nick Dimiduk (JIRA)

Nick Dimiduk created SPARK-19638:


 Summary: Filter pushdown not working for struct fields
 Key: SPARK-19638
 URL: https://issues.apache.org/jira/browse/SPARK-19638
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nick Dimiduk


Working with a dataset containing struct fields, and enabling debug logging in 
the ES connector, I'm seeing the following behavior. The dataframe is created 
over the ES connector and then the schema is extended with a couple column 
aliases, such as.

{noformat}
df.withColumn("f2", df("foo"))
{noformat}

Queries vs those alias columns work as expected for fields that are non-struct 
members.

{noformat}
scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
[IsNotNull(foo),EqualTo(foo,1)]
17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
[{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
{noformat}

However, try the same with an alias over a struct field, and no filters are 
pushed down.

{noformat}
scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
'1'").limit(1).show
{noformat}

In fact, this is the case even when no alias is used at all.

{noformat}
scala> df.where("bar.baz == '1'").limit(1).show
{noformat}

Basically, pushdown for structs doesn't work at all.

Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19637) add to_json APIs to SQL

2017-02-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19637:
---

 Summary: add to_json APIs to SQL
 Key: SPARK-19637
 URL: https://issues.apache.org/jira/browse/SPARK-19637
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Burak Yavuz


The method "to_json" is a useful method in turning a struct into a json string. 
It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-16 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870825#comment-15870825
 ] 

Miao Wang commented on SPARK-19634:
---

I can give a try. Thanks!

Miao

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870816#comment-15870816
 ] 

Josh Rosen commented on SPARK-14658:


Here's the logs from my reproduction, excerpted down to only the relevant parts 
(as near as I can tell):

First attempt of task set being submitted:

{code}
17/02/13 20:11:59 INFO DAGScheduler: waiting: Set(ShuffleMapStage 3086, 
ResultStage 3087)
17/02/13 20:11:59 INFO DAGScheduler: failed: Set()
17/02/13 20:11:59 INFO DAGScheduler: Submitting ShuffleMapStage 3086 
(MapPartitionsRDD[34696] at cache at :61), which has no missing parents
17/02/13 20:11:59 INFO MemoryStore: Block broadcast_2871 stored as values in 
memory (estimated size 67.1 KB, free 9.1 GB)
17/02/13 20:11:59 INFO MemoryStore: Block broadcast_2871_piece0 stored as bytes 
in memory (estimated size 28.1 KB, free 9.1 GB)
17/02/13 20:11:59 INFO BlockManagerInfo: Added broadcast_2871_piece0 in memory 
on :45333 (size: 28.1 KB, free: 10.6 GB)
17/02/13 20:11:59 INFO SparkContext: Created broadcast 2871 from broadcast at 
DAGScheduler.scala:996
17/02/13 20:11:59 INFO DAGScheduler: Submitting 2213 missing tasks from 
ShuffleMapStage 3086 (MapPartitionsRDD[34696] at cache at :61)
17/02/13 20:11:59 INFO TaskSchedulerImpl: Adding task set 3086.0 with 2213 tasks
17/02/13 20:11:59 INFO FairSchedulableBuilder: Added task set TaskSet_3086.0 
tasks to pool 1969095006217179029
{code}

While the task set was running some tasks failed due to fetch failures from the 
parent stage, causing both the stage with the fetch failures and the parent 
stage to be resubmitted:

{code}
17/02/13 20:44:34 WARN TaskSetManager: Lost task 1213.0 in stage 3086.0 (TID 
370751, , executor 622): FetchFailed(null, shuffleId=638, mapId=-1, 
reduceId=0, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 638
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
at 
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at

[jira] [Assigned] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19337:


Assignee: Apache Spark

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870813#comment-15870813
 ] 

Apache Spark commented on SPARK-19337:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/16968

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19337:


Assignee: (was: Apache Spark)

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18409:


Assignee: Apache Spark

> LSH approxNearestNeighbors should use approxQuantile instead of sort
> 
>
> Key: SPARK-18409
> URL: https://issues.apache.org/jira/browse/SPARK-18409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in 
> order to find a threshold.  It should use approxQuantile instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870791#comment-15870791
 ] 

Apache Spark commented on SPARK-18409:
--

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/16966

> LSH approxNearestNeighbors should use approxQuantile instead of sort
> 
>
> Key: SPARK-18409
> URL: https://issues.apache.org/jira/browse/SPARK-18409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in 
> order to find a threshold.  It should use approxQuantile instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18409:


Assignee: (was: Apache Spark)

> LSH approxNearestNeighbors should use approxQuantile instead of sort
> 
>
> Key: SPARK-18409
> URL: https://issues.apache.org/jira/browse/SPARK-18409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in 
> order to find a threshold.  It should use approxQuantile instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18286) Add Scala/Java/Python examples for MinHash and RandomProjection

2017-02-16 Thread Yun Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni resolved SPARK-18286.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Add Scala/Java/Python examples for MinHash and RandomProjection
> ---
>
> Key: SPARK-18286
> URL: https://issues.apache.org/jira/browse/SPARK-18286
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Add Scala/Java/Python examples for MinHash and RandomProjection



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()

2017-02-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780
 ] 

Nicholas Chammas commented on SPARK-19553:
--

The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

I'll open a PR for 3), unless someone else wants to step in and do that.

> Add GroupedData.countApprox()
> -
>
> Key: SPARK-19553
> URL: https://issues.apache.org/jira/browse/SPARK-19553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We already have a 
> [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
>  that can be applied to grouped data, but it seems odd that you can't just 
> get regular approximate count for grouped data.
> I imagine the API would mirror that for 
> [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
>  but I'm not sure:
> {code}
> (df
> .groupBy('col1')
> .countApprox(timeout=300, confidence=0.95)
> .show())
> {code}
> Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
> that too. I'd want to understand why that function doesn't take a timeout or 
> confidence parameter, though. Also, what does {{rsd}} mean? It's not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14658:
---
Description: 
{code}
16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.lang.IllegalStateException: more than one active taskSet for stage 57: 
57.2,57.1
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}

First Time:

{code}
16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 52, 
53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 108, 
109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 150, 151, 
154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 173, 174, 175, 
176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 204, 206, 207, 208, 
218, 219, 222, 223, 230, 231, 236, 238, 239
16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
(MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
missing parents
16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 136, 
115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 87, 52, 
171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 218, 147, 
164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 84, 51, 30, 
199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
{code}

Second Time:

{code}
16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
AccessController.java:-2) because some of its tasks had failed: 26
16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
(MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
missing parents
16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)
{code}

  was:
16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.lang.IllegalStateException: more than one active taskSet for stage 57: 
57.2,57.1
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


First Time:
16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 52, 
53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99,

[jira] [Commented] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870777#comment-15870777
 ] 

Josh Rosen commented on SPARK-14658:


[~srowen], I think that [~yixiaohua] is right here: it looks like SPARK-14649 
is proposing a performance optimization to avoid the wasteful resubmission of 
tasks when failures occur (which is only a perf. optimization, although an 
important one for the workload that [~sitalke...@gmail.com] describes), whereas 
this ticket is discussing a correctness issue where one of the scheduler's 
internal invariants is being violated, causing a total SparkContext shutdown.

I'm going to unmark this as a duplicate and will re-open so that someone can 
review the proposed fix.

I also have additional logs from a fresh occurrence of this bug in Spark 
2.1.0+, which I'll upload in a followup comment here.

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> First Time:
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> Second Time:
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22

[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14658:
---
Affects Version/s: 2.2.0
   2.0.0
   2.1.0

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> First Time:
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> Second Time:
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14658:
---
Component/s: (was: Spark Core)
 Scheduler

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> First Time:
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> Second Time:
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-14658:


> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> First Time:
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> Second Time:
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19628:
--
Fix Version/s: (was: 2.0.1)

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18450:


Assignee: (was: Apache Spark)

> Add AND-amplification to Locality Sensitive Hashing
> ---
>
> Key: SPARK-18450
> URL: https://issues.apache.org/jira/browse/SPARK-18450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. 
> Once the change is applied, we can add AND-amplification to LSH 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870727#comment-15870727
 ] 

Apache Spark commented on SPARK-18450:
--

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/16965

> Add AND-amplification to Locality Sensitive Hashing
> ---
>
> Key: SPARK-18450
> URL: https://issues.apache.org/jira/browse/SPARK-18450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. 
> Once the change is applied, we can add AND-amplification to LSH 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19163) Lazy creation of the _judf

2017-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19163:
-

Assignee: Maciej Szymkiewicz

> Lazy creation of the _judf
> --
>
> Key: SPARK-19163
> URL: https://issues.apache.org/jira/browse/SPARK-19163
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Current state
> Right {{UserDefinedFunction}} eagerly creates {{_judf}} and initializes 
> {{SparkSession}} 
> (https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L1832)
>  as a side effect. This behavior may have undesired results when {{udf}} is 
> imported from a module:
> {{myudfs.py}}
> {code}
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> 
> def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> 
> add_one = udf(_add_one, IntegerType())
> {code}
> 
> 
> Example session:
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: from myudfs import add_one
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/01/07 19:55:44 WARN Utils: Your hostname, xxx resolves to a loopback 
> address: 127.0.1.1; using xxx instead (on interface eth0)
> 17/01/07 19:55:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> In [3]: spark = SparkSession.builder.appName("foo").getOrCreate()
> In [4]: spark.sparkContext.appName
> Out[4]: 'pyspark-shell'
> {code}
> Proposed
> Delay {{_judf}} initialization until the first call.
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: from myudfs import add_one
> In [3]: spark = SparkSession.builder.appName("foo").getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/01/07 19:58:38 WARN Utils: Your hostname, xxx resolves to a loopback 
> address: 127.0.1.1; using xxx instead (on interface eth0)
> 17/01/07 19:58:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> In [4]: spark.sparkContext.appName
> Out[4]: 'foo'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18450:


Assignee: Apache Spark

> Add AND-amplification to Locality Sensitive Hashing
> ---
>
> Key: SPARK-18450
> URL: https://issues.apache.org/jira/browse/SPARK-18450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>Assignee: Apache Spark
>
> We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. 
> Once the change is applied, we can add AND-amplification to LSH 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19586) Incorrect push down filter for double negative in SQL

2017-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19586:
-

Assignee: Xiao Li

> Incorrect push down filter for double negative in SQL
> -
>
> Key: SPARK-19586
> URL: https://issues.apache.org/jira/browse/SPARK-19586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Everett Anderson
>Assignee: Xiao Li
> Fix For: 2.0.3
>
>
> Opening this as it's a somewhat serious issue in the 2.0.x tree in case 
> there's a 2.0.3 planned, but it is fixed in 2.1.0.
> While it works in 1.6.2 and 2.1.0, it appears 2.0.2 has a significant filter 
> optimization error.
> Example:
> {noformat}
> // Create some fake data
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> val rowsRDD = sc.parallelize(Seq(
> Row(1, "fred"),
> Row(2, "amy"),
> Row(3, null)))
> val schema = StructType(Seq(
> StructField("id", IntegerType, nullable = true),
> StructField("username", StringType, nullable = true)))
> 
> val data = sqlContext.createDataFrame(rowsRDD, schema)
> val path = "/tmp/test_data"
> data.write.mode("overwrite").parquet(path)
> val testData = sqlContext.read.parquet(path)
> testData.registerTempTable("filter_test_table")
> {noformat}
> {noformat}
> %sql
> explain select count(*) from filter_test_table where not( username is not 
> null)
> {noformat}
> or
> {noformat}
> spark.sql("select count(*) from filter_test_table where not( username is not 
> null)").explain
> {noformat}
> In 2.0.2, I'm seeing
> {noformat}
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>  +- *HashAggregate(keys=[], functions=[partial_count(1)])
>  +- *Project
>  +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35))
>  +- *BatchedScan parquet default.[username#35] Format: 
> ParquetFormat, InputPaths: , PartitionFilters: [], 
> PushedFilters: [IsNotNull(username), Not(IsNotNull(username))], ReadSchema: 
> struct
> {noformat}
> which seems like both an impossible Filter and an impossible pushed filter.
> In Spark 1.6.2 it was
> {noformat}
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#1822L])
> +- TungstenExchange SinglePartition, None
>  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#1825L])
>  +- Project
>  +- Filter NOT isnotnull(username#1590)
>  +- Scan ParquetRelation[username#1590] InputPaths: , 
> PushedFilters: [Not(IsNotNull(username))]
> {noformat}
> and 2.1.0 it's working again as
> {noformat}
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *Project
>  +- *Filter NOT isnotnull(username#14)
> +- *FileScan parquet [username#14] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[file:/tmp/test_table], PartitionFilters: 
> [], PushedFilters: [Not(IsNotNull(username))], ReadSchema: 
> struct
> {noformat}
> while it's easy for humans in interactive cases to work around this by 
> removing the double negative, it's a bit harder if it's a programmatic 
> construct.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array

2017-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-16043:
-

Assignee: Kazuaki Ishizaki

> Prepare GenericArrayData implementation specialized for a primitive array
> -
>
> Key: SPARK-16043
> URL: https://issues.apache.org/jira/browse/SPARK-16043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> There is a ToDo of GenericArrayData class, which is to eliminate 
> boxing/unboxing for a primitive array (described 
> [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31])
> It would be good to prepare GenericArrayData implementation specialized for a 
> primitive array to eliminate boxing/unboxing from the view of runtime memory 
> footprint and performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)

Timothy Hunter created SPARK-19636:
--

 Summary: Feature parity for correlation statistics in MLlib
 Key: SPARK-19636
 URL: https://issues.apache.org/jira/browse/SPARK-19636
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
over to spark.ml.

Here is a design doc:
https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)

Timothy Hunter created SPARK-19635:
--

 Summary: Feature parity for Chi-square hypothesis testing in MLlib
 Key: SPARK-19635
 URL: https://issues.apache.org/jira/browse/SPARK-19635
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of 
spark.mllib.Statistics.chiSqTest over to spark.ml.

Here is a design doc:
https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)

Timothy Hunter created SPARK-19634:
--

 Summary: Feature parity for descriptive statistics in MLlib
 Key: SPARK-19634
 URL: https://issues.apache.org/jira/browse/SPARK-19634
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of 
spark.mllib.MultivariateOnlineSummarizer over to spark.ml.

A design has been discussed in SPARK-19208 . Here is a design doc:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870664#comment-15870664
 ] 

Apache Spark commented on SPARK-19557:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16962

> Output parameters are not present in SQL Query Plan
> ---
>
> Key: SPARK-19557
> URL: https://issues.apache.org/jira/browse/SPARK-19557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> For DataFrameWriter methods like parquet(), json(), csv() etc. output 
> parameters are not present in the QueryExecution object. For methods like 
> saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19557:


Assignee: Apache Spark

> Output parameters are not present in SQL Query Plan
> ---
>
> Key: SPARK-19557
> URL: https://issues.apache.org/jira/browse/SPARK-19557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>Assignee: Apache Spark
>
> For DataFrameWriter methods like parquet(), json(), csv() etc. output 
> parameters are not present in the QueryExecution object. For methods like 
> saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19557:


Assignee: (was: Apache Spark)

> Output parameters are not present in SQL Query Plan
> ---
>
> Key: SPARK-19557
> URL: https://issues.apache.org/jira/browse/SPARK-19557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> For DataFrameWriter methods like parquet(), json(), csv() etc. output 
> parameters are not present in the QueryExecution object. For methods like 
> saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-16 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870655#comment-15870655
 ] 

Timothy Hunter commented on SPARK-19208:


I put together the ideas in this thread into a document. I will update the 
umbrella ticket with sub tasks once folks have had a chance to comment:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Robert Kruszewski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski closed SPARK-19632.
-
Resolution: Won't Fix

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870649#comment-15870649
 ] 

Robert Kruszewski commented on SPARK-19632:
---

Thanks, I must have missed that.

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870638#comment-15870638
 ] 

Apache Spark commented on SPARK-19534:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16964

> Convert Java tests to use lambdas, Java 8 features
> --
>
> Key: SPARK-19534
> URL: https://issues.apache.org/jira/browse/SPARK-19534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a 
> significant sub-task in its own right. This shouldn't mean that 'old' APIs go 
> untested because there are no separate Java 8 APIs; it's just syntactic sugar 
> for calls to the same APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870630#comment-15870630
 ] 

Dongjoon Hyun commented on SPARK-19632:
---

Hi, [~robert3005].
I just added a previous JIRA issue link for this issue.

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19633) FileSource read from FileSink

2017-02-16 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-19633:


 Summary: FileSource read from FileSink
 Key: SPARK-19633
 URL: https://issues.apache.org/jira/browse/SPARK-19633
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Michael Armbrust
Priority: Critical


Right now, you can't start a streaming query from a location that is being 
written to by the file sink.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19632:


Assignee: Apache Spark

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19632:


Assignee: (was: Apache Spark)

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870617#comment-15870617
 ] 

Apache Spark commented on SPARK-19632:
--

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/16963

> Allow configuring non-hive and non-local SessionState and ExternalCatalog
> -
>
> Key: SPARK-19632
> URL: https://issues.apache.org/jira/browse/SPARK-19632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> I want to be able to integrate non hive catalogs into spark. It seems api is 
> pretty close already. This issue proposes opening up existing interfaces for 
> external implementation.
> For ExternalCatalog there seems to be already an abstract class and 
> SessionState java doc suggests it should stay a scala class to avoid 
> initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf

2017-02-16 Thread Abhishek Madav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870616#comment-15870616
 ] 

Abhishek Madav commented on SPARK-17302:


I believe this is fixed as part of SPARK-15887. Could you check? 

> Cannot set non-Spark SQL session variables in hive-site.xml, 
> spark-defaults.conf, or using --conf
> -
>
> Key: SPARK-17302
> URL: https://issues.apache.org/jira/browse/SPARK-17302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> When configuration changed for 2.0 to the new SparkSession structure, Spark 
> stopped using Hive's internal HiveConf for session state and now uses 
> HiveSessionState and an associated SQLConf. Now, session options like 
> hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled 
> from this SQLConf. This doesn't include session properties from hive-site.xml 
> (including hive.exec.compress.output), and no longer contains Spark-specific 
> overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern.
> Also, setting these variables on the command-line no longer works because 
> settings must start with "spark.".
> Is there a recommended way to set Hive session properties?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19628:
-
Component/s: (was: Spark Core)
 SQL

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19632) Allow configuring non-hive and non-local SessionState and ExternalCatalog

2017-02-16 Thread Robert Kruszewski (JIRA)

Robert Kruszewski created SPARK-19632:
-

 Summary: Allow configuring non-hive and non-local SessionState and 
ExternalCatalog
 Key: SPARK-19632
 URL: https://issues.apache.org/jira/browse/SPARK-19632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Robert Kruszewski


I want to be able to integrate non hive catalogs into spark. It seems api is 
pretty close already. This issue proposes opening up existing interfaces for 
external implementation.

For ExternalCatalog there seems to be already an abstract class and 
SessionState java doc suggests it should stay a scala class to avoid 
initialization order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870586#comment-15870586
 ] 

Apache Spark commented on SPARK-18120:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16962

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18891) Support for specific collection types

2017-02-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870570#comment-15870570
 ] 

Michal Šenkýř commented on SPARK-18891:
---

Started my work on Map support as there is still no progress on the 
optimization PR for Seq. I chose the builder approach as it seemed more 
straightforward. I will keep the Map implementation seperate from the Seq one 
to have it working even in case the Seq PR doesn't get approved.

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19614) add type-preserving null function

2017-02-16 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870525#comment-15870525
 ] 

Nick Dimiduk commented on SPARK-19614:
--

{{lit(null).cast(type)}} does exactly what I needed. Thanks fellas.

> add type-preserving null function
> -
>
> Key: SPARK-19614
> URL: https://issues.apache.org/jira/browse/SPARK-19614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Trivial
>
> There's currently no easy way to extend the columns of a DataFrame with null 
> columns that also preserves the type. {{lit(null)}} evaluates to 
> {{Literal(null, NullType)}}, despite any subsequent hinting, for instance 
> with {{Column.as(String, Metadata)}}. This comes up when programmatically 
> munging data from disparate sources. A function such as {{null(DataType)}} 
> would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19614) add type-preserving null function

2017-02-16 Thread Nick Dimiduk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk closed SPARK-19614.

Resolution: Invalid

> add type-preserving null function
> -
>
> Key: SPARK-19614
> URL: https://issues.apache.org/jira/browse/SPARK-19614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Trivial
>
> There's currently no easy way to extend the columns of a DataFrame with null 
> columns that also preserves the type. {{lit(null)}} evaluates to 
> {{Literal(null, NullType)}}, despite any subsequent hinting, for instance 
> with {{Column.as(String, Metadata)}}. This comes up when programmatically 
> munging data from disparate sources. A function such as {{null(DataType)}} 
> would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19617:
-
Description: 
The streaming thread in StreamExecution uses the following ways to check if it 
should exit:
- Catch an InterruptException.
- `StreamExecution.state` is TERMINATED.

when starting and stopping a query quickly, the above two checks may both fail.
- Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
swallow InterruptException
- StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
[runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
 changes the state from `TERMINATED` to `ACTIVE`.

If the above cases both happen, the query will hang forever.

  was:
Saw the following exception in some test log:
{code}
17/02/14 21:20:10.987 stream execution thread for this_query [id = 
09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: 
Thread[Thread-48,5,main]
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
at java.lang.Thread.join(Thread.java:1323)
at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
at 
org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
at 
org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
at 
org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
{code}

This is the cause of some test timeout failures on Jenkins.


> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects

[jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19617:
-
Summary: Fix the race condition when starting and stopping a query quickly  
(was: Fix a case that a query may not stop due to HADOOP-14084)

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Saw the following exception in some test log:
> {code}
> 17/02/14 21:20:10.987 stream execution thread for this_query [id = 
> 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
> a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining 
> on: Thread[Thread-48,5,main]
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
>   at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
> {code}
> This is the cause of some test timeout failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-16 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870514#comment-15870514
 ] 

yuhao yang commented on SPARK-19337:


Sure, I'll send a PR today. 

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-16 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870513#comment-15870513
 ] 

Nick Dimiduk commented on SPARK-19615:
--

IMHO, a union operation should be as generous as possible. This facilitates 
common ETL and data cleansing operations where the sources are sparse-schema 
structures (JSON, HBase, Elastic Search, ). A couple examples of what I mean.

Given dataframes of type
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: string (nullable = true)
{noformat}
and
{noformat}
root
 |-- a: string (nullable = false)
 |-- c: string (nullable = true)
{noformat}
I would expect the union operation to infer the nullable columns from both 
sides to produce a dataframe of type
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: string (nullable = true)
 |-- c: string (nullable = true)
{noformat}

This should work on an arbitrarily deep nesting of structs, so

{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b1: string (nullable = true)
 ||-- b2: string (nullable = true)
{noformat}
unioned with
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b3: string (nullable = true)
 ||-- b4: string (nullable = true)
{noformat}
would result in
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b1: string (nullable = true)
 ||-- b2: string (nullable = true)
 ||-- b3: string (nullable = true)
 ||-- b4: string (nullable = true)
{noformat}

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Minor
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-16 Thread Nick Dimiduk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated SPARK-19615:
-
Priority: Major  (was: Minor)

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 170 matches

Mail list logo