date:20180420

[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19826:


Assignee: (was: Apache Spark)

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19826:


Assignee: Apache Spark

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446626#comment-16446626
 ] 

Apache Spark commented on SPARK-19826:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21119

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-04-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17916:

Target Version/s: 2.4.0

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Priority: Major
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23325:


Assignee: Apache Spark

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446313#comment-16446313
 ] 

Apache Spark commented on SPARK-23325:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21118

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23325:


Assignee: (was: Apache Spark)

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-20 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24019:
---
Component/s: (was: Spark Core)
 SQL

> AnalysisException for Window function expression to compute derivative
> --
>
> Key: SPARK-24019
> URL: https://issues.apache.org/jira/browse/SPARK-24019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Ubuntu, spark 2.1.1, standalone.
>Reporter: Barry Becker
>Priority: Minor
>
> I am using spark 2.1.1 currently.
> I created an expression to compute the derivative of some series data using a 
> window function.
> I have a simple reproducible case of the error.
> I'm only filing this bug because the error message says "Please file a bug 
> report with this error message, stack trace, and the query."
> Here they are:
> {code:java}
> ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
> value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple 
> Window Specifications (ArrayBuffer(windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
> BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - 
> coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> derivative#14 has multiple Window Specifications 
> (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code}
> And here is a simple unit test that can be used to reproduce the problem:
> {code:java}
> import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import org.scalatest.FunSuite
> import com.mineset.spark.testsupport.SparkTestCase._
> /**
> * Test to see that window functions work as expected on spark.
> * @author Barry Becker
> */
> class WindowFunctionSuite extends FunSuite {
> val simpleDf = createSimpleData()
> test("Window function for finding derivatives for 2 series") {
> val window =
> Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1)
> // Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
> Ylead).
> // This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
> // If the lead or lag points are null, then we fall back on using the

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-20 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446247#comment-16446247
 ] 

Edwina Lu commented on SPARK-23206:
---

The design discussion is scheduled for Monday 4/23 PDT at 11am (Monday April 
23, 6pm UTC).

Bluejeans: https://linkedin.bluejeans.com/1886759322/

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446176#comment-16446176
 ] 

Apache Spark commented on SPARK-10399:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21117

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23879) Introduce MemoryBlock API instead of Platform API with Object

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446177#comment-16446177
 ] 

Apache Spark commented on SPARK-23879:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21117

> Introduce MemoryBlock API instead of Platform API with Object 
> --
>
> Key: SPARK-23879
> URL: https://issues.apache.org/jira/browse/SPARK-23879
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA is derived from SPARK-10399.
> During the discussion, the community revealed that current Spark framework 
> directly accesses several types of memory regions (e.g. {{byte[]}}, 
> {{long[]}}, or {{Off-heap}}) by using {{Platform API with }}{{Object}} type. 
> It would be good to have unified memory management API for clear memory model.
> It is also good for performance. If we can pass typed object (e.g.{{byte[]}}, 
> {{long[]}}) to {{Platform.getXX()/putXX()}}, it is faster than using 
> {{Platform.getXX(Object)/putXX(Object, ...)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-20 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446166#comment-16446166
 ] 

Kazuaki Ishizaki commented on SPARK-10399:
--

https://issues.apache.org/jira/browse/SPARK-23879 is the following JIRA entry.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24038) refactor continuous write exec to its own class

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446161#comment-16446161
 ] 

Apache Spark commented on SPARK-24038:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21116

> refactor continuous write exec to its own class
> ---
>
> Key: SPARK-24038
> URL: https://issues.apache.org/jira/browse/SPARK-24038
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24038:


Assignee: (was: Apache Spark)

> refactor continuous write exec to its own class
> ---
>
> Key: SPARK-24038
> URL: https://issues.apache.org/jira/browse/SPARK-24038
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24038:


Assignee: Apache Spark

> refactor continuous write exec to its own class
> ---
>
> Key: SPARK-24038
> URL: https://issues.apache.org/jira/browse/SPARK-24038
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24041) add flag to remove whitelist of continuous processing operators

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24041:
---

 Summary: add flag to remove whitelist of continuous processing 
operators
 Key: SPARK-24041
 URL: https://issues.apache.org/jira/browse/SPARK-24041
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Initially this will just be for unit testing of developing support, but in the 
long term continuous processing should probably support most query nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24039) remove restarting iterators hack

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24039:
---

 Summary: remove restarting iterators hack
 Key: SPARK-24039
 URL: https://issues.apache.org/jira/browse/SPARK-24039
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Currently, continuous processing execution calls next() to restart the query 
iterator after it returns false. This doesn't work for complex RDDs - we need 
to call compute() instead.

This isn't refactoring-only; changes will be required to keep the reader from 
starting over in each compute() call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24040) support single partition aggregates

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24040:
---

 Summary: support single partition aggregates
 Key: SPARK-24040
 URL: https://issues.apache.org/jira/browse/SPARK-24040
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Single partition aggregates are a useful milestone because they don't involve a 
shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24038) refactor continuous write exec to its own class

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24038:
---

 Summary: refactor continuous write exec to its own class
 Key: SPARK-24038
 URL: https://issues.apache.org/jira/browse/SPARK-24038
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24037) stateful operators

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24037:
---

 Summary: stateful operators
 Key: SPARK-24037
 URL: https://issues.apache.org/jira/browse/SPARK-24037
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24036) Stateful operators in continuous processing

2018-04-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24036:
---

 Summary: Stateful operators in continuous processing
 Key: SPARK-24036
 URL: https://issues.apache.org/jira/browse/SPARK-24036
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


The first iteration of continuous processing in Spark 2.3 does not work with 
stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24033:


Assignee: Xiao Li  (was: Apache Spark)

> LAG Window function broken in Spark 2.3
> ---
>
> Key: SPARK-24033
> URL: https://issues.apache.org/jira/browse/SPARK-24033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Assignee: Xiao Li
>Priority: Major
>
> The {{LAG}} window function appears to be broken in Spark 2.3.0, always 
> failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, 
> so it can be worked around by negating the lag and using lead instead.
> Reproduction (run in {{spark-shell}}):
> {code:java}
> import org.apache.spark.sql.expressions.Window
> val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
> // The following works:
> ds.withColumn("m", lead("i", 
> -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> // The following (equivalent) fails:
> ds.withColumn("m", lag("i", 
> 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> {code}
> Here is the stacktrace:
> {quote}
> org.apache.spark.sql.AnalysisException: Window Frame 
> specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
> specifiedwindowframe(RowFrame, -1, -1);
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
>

[jira] [Commented] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446090#comment-16446090
 ] 

Apache Spark commented on SPARK-24033:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21115

> LAG Window function broken in Spark 2.3
> ---
>
> Key: SPARK-24033
> URL: https://issues.apache.org/jira/browse/SPARK-24033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Assignee: Xiao Li
>Priority: Major
>
> The {{LAG}} window function appears to be broken in Spark 2.3.0, always 
> failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, 
> so it can be worked around by negating the lag and using lead instead.
> Reproduction (run in {{spark-shell}}):
> {code:java}
> import org.apache.spark.sql.expressions.Window
> val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
> // The following works:
> ds.withColumn("m", lead("i", 
> -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> // The following (equivalent) fails:
> ds.withColumn("m", lag("i", 
> 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> {code}
> Here is the stacktrace:
> {quote}
> org.apache.spark.sql.AnalysisException: Window Frame 
> specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
> specifiedwindowframe(RowFrame, -1, -1);
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
>   at 
>

[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24033:


Assignee: Apache Spark  (was: Xiao Li)

> LAG Window function broken in Spark 2.3
> ---
>
> Key: SPARK-24033
> URL: https://issues.apache.org/jira/browse/SPARK-24033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Assignee: Apache Spark
>Priority: Major
>
> The {{LAG}} window function appears to be broken in Spark 2.3.0, always 
> failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, 
> so it can be worked around by negating the lag and using lead instead.
> Reproduction (run in {{spark-shell}}):
> {code:java}
> import org.apache.spark.sql.expressions.Window
> val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
> // The following works:
> ds.withColumn("m", lead("i", 
> -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> // The following (equivalent) fails:
> ds.withColumn("m", lag("i", 
> 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> {code}
> Here is the stacktrace:
> {quote}
> org.apache.spark.sql.AnalysisException: Window Frame 
> specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
> specifiedwindowframe(RowFrame, -1, -1);
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
>

[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22371:


Assignee: (was: Apache Spark)

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446084#comment-16446084
 ] 

Apache Spark commented on SPARK-22371:
--

User 'artemrd' has created a pull request for this issue:
https://github.com/apache/spark/pull/21114

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22371:


Assignee: Apache Spark

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Assignee: Apache Spark
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-20 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-23775:

  Assignee: (was: Gabor Somogyi)

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24035) SQL syntax for Pivot

2018-04-20 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446058#comment-16446058
 ] 

Xiao Li commented on SPARK-24035:
-

Also cc [~simeons] 

> SQL syntax for Pivot
> 
>
> Key: SPARK-24035
> URL: https://issues.apache.org/jira/browse/SPARK-24035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Maryann Xue
>Priority: Major
>
> Some users who are SQL experts but don’t know an ounce of Scala/Python or R. 
> Thus, we prefer to supporting the SQL syntax for Pivot too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23775:


Assignee: (was: Apache Spark)

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23775:


Assignee: Apache Spark

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-20 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23775:
---
Fix Version/s: (was: 2.3.1)
   (was: 2.4.0)

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24033:
---

Assignee: Xiao Li

> LAG Window function broken in Spark 2.3
> ---
>
> Key: SPARK-24033
> URL: https://issues.apache.org/jira/browse/SPARK-24033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Assignee: Xiao Li
>Priority: Major
>
> The {{LAG}} window function appears to be broken in Spark 2.3.0, always 
> failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, 
> so it can be worked around by negating the lag and using lead instead.
> Reproduction (run in {{spark-shell}}):
> {code:java}
> import org.apache.spark.sql.expressions.Window
> val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
> // The following works:
> ds.withColumn("m", lead("i", 
> -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> // The following (equivalent) fails:
> ds.withColumn("m", lag("i", 
> 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
> {code}
> Here is the stacktrace:
> {quote}
> org.apache.spark.sql.AnalysisException: Window Frame 
> specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
> specifiedwindowframe(RowFrame, -1, -1);
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
>

[jira] [Created] (SPARK-24035) SQL syntax for Pivot

2018-04-20 Thread Xiao Li (JIRA)

Xiao Li created SPARK-24035:
---

 Summary: SQL syntax for Pivot
 Key: SPARK-24035
 URL: https://issues.apache.org/jira/browse/SPARK-24035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Maryann Xue


Some users who are SQL experts but don’t know an ounce of Scala/Python or R. 
Thus, we prefer to supporting the SQL syntax for Pivot too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-20 Thread Artem Rudoy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445999#comment-16445999
 ] 

Artem Rudoy edited comment on SPARK-22371 at 4/20/18 4:37 PM:
--

Do we really need to throw an exception from AccumulatorContext.get() when an 
accumulator is garbage collected? There's a period of time when an accumulator 
has been garbage collected, but hasn't been removed from 
AccumulatorContext.originals by ContextCleaner. When an update is received for 
such accumulator it will throw an exception and kill the whole job. This can 
happen when a stage completed, but there're still running tasks from other 
attempts, speculation etc. Since AccumulatorContext.get() returns an Option we 
could just return None in such case. Before SPARK-20940 this method threw 
IllegalAccessError which is not a NonFatal, was caught at a lower level and 
didn't cause job failure.


was (Author: rudoy):
Do we really need to throw an exception from AccumulatorContext.get() when an 
accumulator is garbage collected? There's a period of time when an accumulator 
has been garbage collected, but hasn't been removed from 
AccumulatorContext.originals by ContextCleaner. When an update is received for 
such accumulator it will throw an exception and kill the whole job. This can 
happen when a stage completed, but there're still running tasks from other 
attempts, speculation etc. Since AccumulatorContext.get() returns an Option we 
could just return None in such case. Before SPARK-20940 this method thrown 
IllegalAccessError which is not a NonFatal, was caught at a lower level and 
didn't cause job failure.

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-20 Thread Artem Rudoy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445999#comment-16445999
 ] 

Artem Rudoy commented on SPARK-22371:
-

Do we really need to throw an exception from AccumulatorContext.get() when an 
accumulator is garbage collected? There's a period of time when an accumulator 
has been garbage collected, but hasn't been removed from 
AccumulatorContext.originals by ContextCleaner. When an update is received for 
such accumulator it will throw an exception and kill the whole job. This can 
happen when a stage completed, but there're still running tasks from other 
attempts, speculation etc. Since AccumulatorContext.get() returns an Option we 
could just return None in such case. Before SPARK-20940 this method thrown 
IllegalAccessError which is not a NonFatal, was caught at a lower level and 
didn't cause job failure.

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Dorigatti updated SPARK-24034:
-
Description: 
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper)collect()
{noformat}

The result I get is `[57, 71, 85]`

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 

  was:
Consider the following code

    def mapper(xx):
    if xx % 2 == 0:
    raise StopIteration()
    else:
    return xx
    
    sc.parallelize(range(100)).map(mapper)collect()


The result I get is `[57, 71, 85]`

I think it happens because `map` is implemented in terms of  
`mapPartitionsWithIndex` using a custom iterator, so the StopIteration raised 
by the mapper is handled by that iterator. I think this should be raised to the 
user instead

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 


> StopIteration in pyspark mapper results in partial results
> --
>
> Key: SPARK-24034
> URL: https://issues.apache.org/jira/browse/SPARK-24034
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Emilio Dorigatti
>Priority: Major
>
> Consider the following code
> {noformat}
> def mapper(xx):
>  if xx % 2 == 0:
>  raise StopIteration()
>  else:
>  return xx
> sc.parallelize(range(100)).map(mapper)collect()
> {noformat}
> The result I get is `[57, 71, 85]`
> I think it happens because {{map }}is implemented in terms of 
> {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
> raised by the mapper is handled by that iterator. I think this should be 
> raised to the user instead
> NB: this may be the underlying cause of 
> https://issues.apache.org/jira/browse/SPARK-23754
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24034) StopIteration in pyspark mapper results in partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)

Emilio Dorigatti created SPARK-24034:


 Summary: StopIteration in pyspark mapper results in partial results
 Key: SPARK-24034
 URL: https://issues.apache.org/jira/browse/SPARK-24034
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Emilio Dorigatti


Consider the following code

    def mapper(xx):
    if xx % 2 == 0:
    raise StopIteration()
    else:
    return xx
    
    sc.parallelize(range(100)).map(mapper)collect()


The result I get is `[57, 71, 85]`

I think it happens because `map` is implemented in terms of  
`mapPartitionsWithIndex` using a custom iterator, so the StopIteration raised 
by the mapper is handled by that iterator. I think this should be raised to the 
user instead

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper gives partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Dorigatti updated SPARK-24034:
-
Description: 
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper).collect()
{noformat}

The result I get is {{[57, 71, 85]}}

I think it happens because {{map}} is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead.

I think I can take care of this, if I am allowed to (first time I contribute, 
not sure how it works)

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 

  was:
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper).collect()
{noformat}

The result I get is {{[57, 71, 85]}}

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead.

I think I can take care of this, if I am allowed to (first time I contribute, 
not sure how it works)

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 


> StopIteration in pyspark mapper gives partial results
> -
>
> Key: SPARK-24034
> URL: https://issues.apache.org/jira/browse/SPARK-24034
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Emilio Dorigatti
>Priority: Major
>
> Consider the following code
> {noformat}
> def mapper(xx):
>  if xx % 2 == 0:
>  raise StopIteration()
>  else:
>  return xx
> sc.parallelize(range(100)).map(mapper).collect()
> {noformat}
> The result I get is {{[57, 71, 85]}}
> I think it happens because {{map}} is implemented in terms of 
> {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
> raised by the mapper is handled by that iterator. I think this should be 
> raised to the user instead.
> I think I can take care of this, if I am allowed to (first time I contribute, 
> not sure how it works)
> NB: this may be the underlying cause of 
> https://issues.apache.org/jira/browse/SPARK-23754
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Dorigatti updated SPARK-24034:
-
Description: 
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper)collect()
{noformat}

The result I get is {{[57, 71, 85]}}

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead.

I think I can take care of this, if I am allowed to (first time I contribute, 
not sure how it works)

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 

  was:
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper)collect()
{noformat}

The result I get is `[57, 71, 85]`

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 


> StopIteration in pyspark mapper results in partial results
> --
>
> Key: SPARK-24034
> URL: https://issues.apache.org/jira/browse/SPARK-24034
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Emilio Dorigatti
>Priority: Major
>
> Consider the following code
> {noformat}
> def mapper(xx):
>  if xx % 2 == 0:
>  raise StopIteration()
>  else:
>  return xx
> sc.parallelize(range(100)).map(mapper)collect()
> {noformat}
> The result I get is {{[57, 71, 85]}}
> I think it happens because {{map }}is implemented in terms of 
> {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
> raised by the mapper is handled by that iterator. I think this should be 
> raised to the user instead.
> I think I can take care of this, if I am allowed to (first time I contribute, 
> not sure how it works)
> NB: this may be the underlying cause of 
> https://issues.apache.org/jira/browse/SPARK-23754
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Dorigatti updated SPARK-24034:
-
Description: 
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper).collect()
{noformat}

The result I get is {{[57, 71, 85]}}

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead.

I think I can take care of this, if I am allowed to (first time I contribute, 
not sure how it works)

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 

  was:
Consider the following code
{noformat}
def mapper(xx):
 if xx % 2 == 0:
 raise StopIteration()
 else:
 return xx
sc.parallelize(range(100)).map(mapper)collect()
{noformat}

The result I get is {{[57, 71, 85]}}

I think it happens because {{map }}is implemented in terms of 
{{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
raised by the mapper is handled by that iterator. I think this should be raised 
to the user instead.

I think I can take care of this, if I am allowed to (first time I contribute, 
not sure how it works)

NB: this may be the underlying cause of 
https://issues.apache.org/jira/browse/SPARK-23754

 


> StopIteration in pyspark mapper results in partial results
> --
>
> Key: SPARK-24034
> URL: https://issues.apache.org/jira/browse/SPARK-24034
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Emilio Dorigatti
>Priority: Major
>
> Consider the following code
> {noformat}
> def mapper(xx):
>  if xx % 2 == 0:
>  raise StopIteration()
>  else:
>  return xx
> sc.parallelize(range(100)).map(mapper).collect()
> {noformat}
> The result I get is {{[57, 71, 85]}}
> I think it happens because {{map }}is implemented in terms of 
> {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
> raised by the mapper is handled by that iterator. I think this should be 
> raised to the user instead.
> I think I can take care of this, if I am allowed to (first time I contribute, 
> not sure how it works)
> NB: this may be the underlying cause of 
> https://issues.apache.org/jira/browse/SPARK-23754
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper gives partial results

2018-04-20 Thread Emilio Dorigatti (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Dorigatti updated SPARK-24034:
-
Summary: StopIteration in pyspark mapper gives partial results  (was: 
StopIteration in pyspark mapper results in partial results)

> StopIteration in pyspark mapper gives partial results
> -
>
> Key: SPARK-24034
> URL: https://issues.apache.org/jira/browse/SPARK-24034
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Emilio Dorigatti
>Priority: Major
>
> Consider the following code
> {noformat}
> def mapper(xx):
>  if xx % 2 == 0:
>  raise StopIteration()
>  else:
>  return xx
> sc.parallelize(range(100)).map(mapper).collect()
> {noformat}
> The result I get is {{[57, 71, 85]}}
> I think it happens because {{map }}is implemented in terms of 
> {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} 
> raised by the mapper is handled by that iterator. I think this should be 
> raised to the user instead.
> I think I can take care of this, if I am allowed to (first time I contribute, 
> not sure how it works)
> NB: this may be the underlying cause of 
> https://issues.apache.org/jira/browse/SPARK-23754
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.

2018-04-20 Thread Seok-Joon,Yun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seok-Joon,Yun updated SPARK-24030:
--
Attachment: screenshot_2018-04-20 23.15.02.png

> SparkSQL percentile_approx function is too slow for over 1,060,000 records.
> ---
>
> Key: SPARK-24030
> URL: https://issues.apache.org/jira/browse/SPARK-24030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop.
>Reporter: Seok-Joon,Yun
>Priority: Major
> Attachments: screenshot_2018-04-20 23.15.02.png
>
>
> I used percentile_approx functions for over 1,060,000 records. It is too 
> slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about 
> 10 secs.
> I tested for data reading on JDBC and parquet. It takes same time lengths.
> I wonder that function is not designed for multi worker.
> I looked gangglia and spark history. It worked on one worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-04-20 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23595.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24031:
--
Affects Version/s: (was: 3.0.0)
   2.3.0
 Priority: Minor  (was: Major)

There's no real detail here. You'd have to state the problem and solution, and 
open a PR. Otherwise I'd close this.

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yu Wang
>Priority: Minor
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-24033:


 Summary: LAG Window function broken in Spark 2.3
 Key: SPARK-24033
 URL: https://issues.apache.org/jira/browse/SPARK-24033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Emlyn Corrin


The {{LAG}} window function appears to be broken in Spark 2.3.0, always failing 
with an AnalysisException. Interestingly, {{LEAD}} is not affected, so it can 
be worked around by negating the lag and using lead instead.

Reproduction (run in {{spark-shell}}):
{code:java}
import org.apache.spark.sql.expressions.Window
val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
// The following works:
ds.withColumn("m", lead("i", 
-1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
// The following (equivalent) fails:
ds.withColumn("m", lag("i", 
1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
{code}

Here is the stacktrace:
{quote}
org.apache.spark.sql.AnalysisException: Window Frame 
specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
specifiedwindowframe(RowFrame, -1, -1);
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$.apply(Analyzer.scala:2029)
  at

[jira] [Updated] (SPARK-22823) Race Condition when reading Broadcast shuffle input. Failed to get broadcast piece

2018-04-20 Thread Dmitrii Bundin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitrii Bundin updated SPARK-22823:
---
Affects Version/s: 2.3.0

> Race Condition when reading Broadcast shuffle input. Failed to get broadcast 
> piece
> --
>
> Key: SPARK-22823
> URL: https://issues.apache.org/jira/browse/SPARK-22823
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.1, 2.2.1, 2.3.0
>Reporter: Dmitrii Bundin
>Priority: Major
>
> It seems we have a race condition when trying to read shuffle input which is 
> a broadcast, not direct. To read broadcast MapStatuses at 
> {code:java}
> org.apache.spark.shuffle.BlockStoreShuffleReader::read()
> {code}
> we submit a message of the type GetMapOutputStatuses(shuffleId) to be 
> executed in MapOutputTrackerMaster's pool which in turn ends up in creating a 
> new broadcast at
> {code:java}
> org.apache.spark.MapOutputTracker::serializeMapStatuses
> {code}
> if the received statuses bytes more than minBroadcastSize.
> So registering the newly created broadcast in the driver's 
> BlockManagerMasterEndpoint may appear later than some executor asks for the 
> broadcast piece location from the driver.
> In out project we get the following exception on the regular basis:
> {code:java}
> java.io.IOException: org.apache.spark.SparkException: Failed to get 
> broadcast_176_piece0 of broadcast_176
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661)
> at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:598)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:660)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:203)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:142)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get 
> broadcast_176_piece0 of broadcast_176
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
> {code}
> This exception is appeared when we try to read a broadcast piece. To do this 
> we need to fetch the broadcast piece location from the driver 
> {code:java}
> org.apache.spark.storage.BlockManagerMaster::getLocations(blockId: BlockId)
> {code}
> . The driver responses with empty list of

[jira] [Resolved] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24032.
--
Resolution: Invalid

I'd not open an issue unless it's clear. Let me leave this resolved for now but 
please let me know if new fact comes.

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Major
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Cristina Luengo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445522#comment-16445522
 ] 

Cristina Luengo commented on SPARK-24032:
-

Thanks for the very fast reply! I wasn't sure whether it was fully and ES 
problem or a Spark one, but I also escalated the issue to ES.

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Major
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445519#comment-16445519
 ] 

Hyukjin Kwon commented on SPARK-24032:
--

Yea, doesn't that look a problem from elasticsearch thirdparty library? I'd 
resolve this JIRA in Spark side.

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Major
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-20 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445520#comment-16445520
 ] 

Jacek Laskowski commented on SPARK-24025:
-

it seems related or duplicated

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24032:
-
Priority: Major  (was: Critical)

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Major
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445516#comment-16445516
 ] 

Hyukjin Kwon commented on SPARK-24025:
--

I haven't taken a close look yet but it should be good to just link it with 
"relates to" with a comment that it seems related or duplicated, for now. 
Further actions like investigation, testing or fixing should even better of 
course. We can change the link when a better fact arrives :-).

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Cristina Luengo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445514#comment-16445514
 ] 

Cristina Luengo commented on SPARK-24032:
-

Ok thanks!! Sorry, I didn't really know what to assign it!

The full stack trace is:

Traceback (most recent call last):
  File "/home/cluengo/Documents/Projects/vcfLoader/python/rdconnect/main.py", 
line 151, in 
    main(hc,sqlContext)
  File "/home/cluengo/Documents/Projects/vcfLoader/python/rdconnect/main.py", 
line 141, in main
    
variantsRN.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
 
).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
  File "/home/cluengo/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
line 595, in save
  File 
"/home/cluengo/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
1133, in __call__
  File "/home/cluengo/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
63, in deco
    
  File "/home/cluengo/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o181.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage 41.0 
(TID 200, localhost, executor driver): java.lang.ClassCastException: 
scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
    at 
org.elasticsearch.spark.sql.DataFrameValueWriter.write(DataFrameValueWriter.scala:53)
    at 
org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.doWrite(AbstractBulkFactory.java:152)
    at 
org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:118)
    at 
org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80)
    at 
org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56)
    at 
org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168)
    at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
    at 
org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
    at 
org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
    at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
    at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
    at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
    at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075)
    at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:101)
    at 
org.elasticsearch.spark.sql.ElasticsearchRelation.insert(DefaultSource.scala:610)
    at 
org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:103)
    at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
    at

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-20 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445512#comment-16445512
 ] 

Jacek Laskowski commented on SPARK-24025:
-

I was about to have closed this as a duplicate, but I'm not so sure anymore. 
Reading the following in SPARK-17570:
{quote}If the number of buckets in the output table is a factor of the buckets 
in the input table, we should be able to avoid `Sort` and `Exchange` and 
directly join those.
{quote}
I'm no longer so sure that the two issues are perfectly identical, but they're 
close "relatives". I think the next step would be to write a test case to 
reproduce this issue and the fairly simple fix would be to have an physical 
optimization (rule) that would eliminate one extra Exchange and Sort ops. Who'd 
help me here?

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445510#comment-16445510
 ] 

Hyukjin Kwon commented on SPARK-24032:
--

BTW, please avoid to set a Critical+ which is usually reserved for committers.

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Major
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445507#comment-16445507
 ] 

Hyukjin Kwon commented on SPARK-24032:
--

Can you post the full stack trace? I doubt if it's a Spark issue.

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Critical
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16854) mapWithState Support for Python

2018-04-20 Thread Fatma Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445503#comment-16445503
 ] 

Fatma Ali commented on SPARK-16854:
---

+1

> mapWithState Support for Python
> ---
>
> Key: SPARK-16854
> URL: https://issues.apache.org/jira/browse/SPARK-16854
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Boaz
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Cristina Luengo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristina Luengo updated SPARK-24032:

Description: 
I'm trying to update a nested field on ElasticSearch using a scripted update. 
The code I'm using is:

update_params = "new_samples: samples"
 update_script = "ctx._source.samples += new_samples"

es_conf =

{ "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
"upsert", "es.update.script.params": update_params, "es.update.script.inline": 
update_script }

result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
 
).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')

And the schema of the field is:
|– samples: array (nullable = true)|
| |– element: struct (containsNull = true)|
| | |– gq: integer (nullable = true)|
| | |– dp: integer (nullable = true)|
| | |– gt: string (nullable = true)|
| | |– adBug: array (nullable = true)|
| | | |– element: integer (containsNull = true)|
| | |– ad: double (nullable = true)|
| | |– sample: string (nullable = true)|

And I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1, localhost, executor driver): java.lang.ClassCastException: 
scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2

Am I doing something wrong or is this a bug? Thanks.

  was:
I'm trying to update a nested field using Spark and a scripted update. The code 
I'm using is:

update_params = "new_samples: samples"
 update_script = "ctx._source.samples += new_samples"

es_conf = {
 "es.mapping.id": "id",
 "es.mapping.exclude": "id",
 "es.write.operation": "upsert",
 "es.update.script.params": update_params,
 "es.update.script.inline": update_script
 }

result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
 
).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')

And the schema of the field is:

|-- samples: array (nullable = true)
 | |-- element: struct (containsNull = true)
 | | |-- gq: integer (nullable = true)
 | | |-- dp: integer (nullable = true)
 | | |-- gt: string (nullable = true)
 | | |-- adBug: array (nullable = true)
 | | | |-- element: integer (containsNull = true)
 | | |-- ad: double (nullable = true)
 | | |-- sample: string (nullable = true)

And I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1, localhost, executor driver): java.lang.ClassCastException: 
scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2

Am I doing something wrong or is this a bug? Thanks.


> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Critical
>
> I'm trying to update a nested field on ElasticSearch using a scripted update. 
> The code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf =
> { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": 
> "upsert", "es.update.script.params": update_params, 
> "es.update.script.inline": update_script }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |– samples: array (nullable = true)|
> | |– element: struct (containsNull = true)|
> | | |– gq: integer (nullable = true)|
> | | |– dp: integer (nullable = true)|
> | | |– gt: string (nullable = true)|
> | | |– adBug: array (nullable = true)|
> | | | |– element: integer (containsNull = true)|
> | | |– ad: double (nullable = true)|
> | | |– sample: string (nullable = true)|
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0

[jira] [Created] (SPARK-24032) ElasticSearch updata fails on nested field

2018-04-20 Thread Cristina Luengo (JIRA)

Cristina Luengo created SPARK-24032:
---

 Summary: ElasticSearch updata fails on nested field
 Key: SPARK-24032
 URL: https://issues.apache.org/jira/browse/SPARK-24032
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Cristina Luengo


I'm trying to update a nested field using Spark and a scripted update. The code 
I'm using is:

update_params = "new_samples: samples"
 update_script = "ctx._source.samples += new_samples"

es_conf = {
 "es.mapping.id": "id",
 "es.mapping.exclude": "id",
 "es.write.operation": "upsert",
 "es.update.script.params": update_params,
 "es.update.script.inline": update_script
 }

result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
 
).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')

And the schema of the field is:

|-- samples: array (nullable = true)
 | |-- element: struct (containsNull = true)
 | | |-- gq: integer (nullable = true)
 | | |-- dp: integer (nullable = true)
 | | |-- gt: string (nullable = true)
 | | |-- adBug: array (nullable = true)
 | | | |-- element: integer (containsNull = true)
 | | |-- ad: double (nullable = true)
 | | |-- sample: string (nullable = true)

And I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1, localhost, executor driver): java.lang.ClassCastException: 
scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2

Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field

2018-04-20 Thread Cristina Luengo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristina Luengo updated SPARK-24032:

Summary: ElasticSearch update fails on nested field  (was: ElasticSearch 
updata fails on nested field)

> ElasticSearch update fails on nested field
> --
>
> Key: SPARK-24032
> URL: https://issues.apache.org/jira/browse/SPARK-24032
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Cristina Luengo
>Priority: Critical
>
> I'm trying to update a nested field using Spark and a scripted update. The 
> code I'm using is:
> update_params = "new_samples: samples"
>  update_script = "ctx._source.samples += new_samples"
> es_conf = {
>  "es.mapping.id": "id",
>  "es.mapping.exclude": "id",
>  "es.write.operation": "upsert",
>  "es.update.script.params": update_params,
>  "es.update.script.inline": update_script
>  }
> result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"]
>  
> ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append')
> And the schema of the field is:
> |-- samples: array (nullable = true)
>  | |-- element: struct (containsNull = true)
>  | | |-- gq: integer (nullable = true)
>  | | |-- dp: integer (nullable = true)
>  | | |-- gt: string (nullable = true)
>  | | |-- adBug: array (nullable = true)
>  | | | |-- element: integer (containsNull = true)
>  | | |-- ad: double (nullable = true)
>  | | |-- sample: string (nullable = true)
> And I get the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost, executor driver): java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2
> Am I doing something wrong or is this a bug? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13136) Data exchange (shuffle, broadcast) should only be handled by the exchange operator

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445428#comment-16445428
 ] 

Apache Spark commented on SPARK-13136:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21113

> Data exchange (shuffle, broadcast) should only be handled by the exchange 
> operator
> --
>
> Key: SPARK-13136
> URL: https://issues.apache.org/jira/browse/SPARK-13136
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.0.0
>
>
> In an ideal architecture, we have a very small number of physical operators 
> that handle data exchanges, and the rest simply declare the input data 
> distribution needed and let the planner inject the right Exchange operators.
> We have almost that, except the following few operators:
> 1. Limit: does its own shuffle or collect to get data to a single partition.
> 2. Except: does its own shuffle; note that this operator is going away and 
> will be replaced by anti-join (SPARK-12660).
> 3. broadcast joins: broadcast joins do its own broadcast, which is a form of 
> data exchange.
> Here are a straw man for limit. Split the current Limit operator into two: a 
> partition-local limit and a terminal limit. Partition-local limit is just a 
> normal unary operator. The terminal limit requires the input data 
> distribution to be a single partition, and then takes its own limit. We then 
> update the planner (strategies) to turn a logical limit into a partition 
> local limit and a terminal limit.
> For broadcast join, it is more involved. We would need to design the 
> interface for the physical operators (e.g. we are no longer taking an 
> iterator as input on the probe side), and allow Exchange to handle data 
> broadcast.
> Note that this is an important step towards creating a clear delineation 
> between distributed query execution and single-threaded query execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23588) Add interpreted execution for CatalystToExternalMap expression

2018-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445382#comment-16445382
 ] 

Apache Spark commented on SPARK-23588:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21112

> Add interpreted execution for CatalystToExternalMap expression
> --
>
> Key: SPARK-23588
> URL: https://issues.apache.org/jira/browse/SPARK-23588
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445381#comment-16445381
 ] 

Yu Wang commented on SPARK-24031:
-

@[~srowen] [~joshrosen]  please have a look.

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Major
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang reopened SPARK-24031:
-

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Major
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang resolved SPARK-24031.
-
Resolution: Fixed

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Major
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-24031:

Target Version/s:   (was: 3.0.0)
   Fix Version/s: (was: 3.0.0)

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Major
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-23705:

Comment: was deleted

(was: [~khoatrantan2000] Could you assign this patch to me?)

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-23705:

Attachment: (was: SPARK-23705.patch)

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-24031:

Attachment: 24031_master_1.patch

> the method of postTaskEnd should write once in handleTaskCompletion
> ---
>
> Key: SPARK-24031
> URL: https://issues.apache.org/jira/browse/SPARK-24031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 24031_master_1.patch
>
>
> the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion

2018-04-20 Thread Yu Wang (JIRA)

Yu Wang created SPARK-24031:
---

 Summary: the method of postTaskEnd should write once in 
handleTaskCompletion
 Key: SPARK-24031
 URL: https://issues.apache.org/jira/browse/SPARK-24031
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yu Wang
 Fix For: 3.0.0


the method of postTaskEnd should write once in handleTaskCompletion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

71 matches

Mail list logo