[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19826: Assignee: (was: Apache Spark) > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19826: Assignee: Apache Spark > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446626#comment-16446626 ] Apache Spark commented on SPARK-19826: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21119 > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17916: Target Version/s: 2.4.0 > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki >Priority: Major > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23325: Assignee: Apache Spark > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446313#comment-16446313 ] Apache Spark commented on SPARK-23325: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21118 > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23325: Assignee: (was: Apache Spark) > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24019) AnalysisException for Window function expression to compute derivative
[ https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24019: --- Component/s: (was: Spark Core) SQL > AnalysisException for Window function expression to compute derivative > -- > > Key: SPARK-24019 > URL: https://issues.apache.org/jira/browse/SPARK-24019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Ubuntu, spark 2.1.1, standalone. >Reporter: Barry Becker >Priority: Minor > > I am using spark 2.1.1 currently. > I created an expression to compute the derivative of some series data using a > window function. > I have a simple reproducible case of the error. > I'm only filing this bug because the error message says "Please file a bug > report with this error message, stack trace, and the query." > Here they are: > {code:java} > ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, > sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), > value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, > sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), > value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) > windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN > 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, > 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, > ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) > windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE > BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple > Window Specifications (ArrayBuffer(windowspecdefinition(category#8, > sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT > ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS > BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, > sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))). > Please file a bug report with this error message, stack trace, and the query.; > org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) > windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN > 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) > windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN > 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, > 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, > ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - > coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, > sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), > sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 > ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS > derivative#14 has multiple Window Specifications > (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), > windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN > 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 > ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))). > Please file a bug report with this error message, stack trace, and the query.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code} > And here is a simple unit test that can be used to reproduce the problem: > {code:java} > import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION > import org.apache.spark.sql.Column > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import org.scalatest.FunSuite > import com.mineset.spark.testsupport.SparkTestCase._ > /** > * Test to see that window functions work as expected on spark. > * @author Barry Becker > */ > class WindowFunctionSuite extends FunSuite { > val simpleDf = createSimpleData() > test("Window function for finding derivatives for 2 series") { > val window = > Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1) > // Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, > Ylead). > // This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag) > // If the lead or lag points are null, then we fall back on using the
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446247#comment-16446247 ] Edwina Lu commented on SPARK-23206: --- The design discussion is scheduled for Monday 4/23 PDT at 11am (Monday April 23, 6pm UTC). Bluejeans: https://linkedin.bluejeans.com/1886759322/ > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446176#comment-16446176 ] Apache Spark commented on SPARK-10399: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21117 > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss >Priority: Major > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23879) Introduce MemoryBlock API instead of Platform API with Object
[ https://issues.apache.org/jira/browse/SPARK-23879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446177#comment-16446177 ] Apache Spark commented on SPARK-23879: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21117 > Introduce MemoryBlock API instead of Platform API with Object > -- > > Key: SPARK-23879 > URL: https://issues.apache.org/jira/browse/SPARK-23879 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA is derived from SPARK-10399. > During the discussion, the community revealed that current Spark framework > directly accesses several types of memory regions (e.g. {{byte[]}}, > {{long[]}}, or {{Off-heap}}) by using {{Platform API with }}{{Object}} type. > It would be good to have unified memory management API for clear memory model. > It is also good for performance. If we can pass typed object (e.g.{{byte[]}}, > {{long[]}}) to {{Platform.getXX()/putXX()}}, it is faster than using > {{Platform.getXX(Object)/putXX(Object, ...)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446166#comment-16446166 ] Kazuaki Ishizaki commented on SPARK-10399: -- https://issues.apache.org/jira/browse/SPARK-23879 is the following JIRA entry. > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss >Priority: Major > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24038) refactor continuous write exec to its own class
[ https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446161#comment-16446161 ] Apache Spark commented on SPARK-24038: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/21116 > refactor continuous write exec to its own class > --- > > Key: SPARK-24038 > URL: https://issues.apache.org/jira/browse/SPARK-24038 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class
[ https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24038: Assignee: (was: Apache Spark) > refactor continuous write exec to its own class > --- > > Key: SPARK-24038 > URL: https://issues.apache.org/jira/browse/SPARK-24038 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class
[ https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24038: Assignee: Apache Spark > refactor continuous write exec to its own class > --- > > Key: SPARK-24038 > URL: https://issues.apache.org/jira/browse/SPARK-24038 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24041) add flag to remove whitelist of continuous processing operators
Jose Torres created SPARK-24041: --- Summary: add flag to remove whitelist of continuous processing operators Key: SPARK-24041 URL: https://issues.apache.org/jira/browse/SPARK-24041 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres Initially this will just be for unit testing of developing support, but in the long term continuous processing should probably support most query nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24039) remove restarting iterators hack
Jose Torres created SPARK-24039: --- Summary: remove restarting iterators hack Key: SPARK-24039 URL: https://issues.apache.org/jira/browse/SPARK-24039 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres Currently, continuous processing execution calls next() to restart the query iterator after it returns false. This doesn't work for complex RDDs - we need to call compute() instead. This isn't refactoring-only; changes will be required to keep the reader from starting over in each compute() call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24040) support single partition aggregates
Jose Torres created SPARK-24040: --- Summary: support single partition aggregates Key: SPARK-24040 URL: https://issues.apache.org/jira/browse/SPARK-24040 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres Single partition aggregates are a useful milestone because they don't involve a shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24038) refactor continuous write exec to its own class
Jose Torres created SPARK-24038: --- Summary: refactor continuous write exec to its own class Key: SPARK-24038 URL: https://issues.apache.org/jira/browse/SPARK-24038 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24037) stateful operators
Jose Torres created SPARK-24037: --- Summary: stateful operators Key: SPARK-24037 URL: https://issues.apache.org/jira/browse/SPARK-24037 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres pointer to https://issues.apache.org/jira/browse/SPARK-24036 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24036) Stateful operators in continuous processing
Jose Torres created SPARK-24036: --- Summary: Stateful operators in continuous processing Key: SPARK-24036 URL: https://issues.apache.org/jira/browse/SPARK-24036 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres The first iteration of continuous processing in Spark 2.3 does not work with stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24033: Assignee: Xiao Li (was: Apache Spark) > LAG Window function broken in Spark 2.3 > --- > > Key: SPARK-24033 > URL: https://issues.apache.org/jira/browse/SPARK-24033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Assignee: Xiao Li >Priority: Major > > The {{LAG}} window function appears to be broken in Spark 2.3.0, always > failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, > so it can be worked around by negating the lag and using lead instead. > Reproduction (run in {{spark-shell}}): > {code:java} > import org.apache.spark.sql.expressions.Window > val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i") > // The following works: > ds.withColumn("m", lead("i", > -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > // The following (equivalent) fails: > ds.withColumn("m", lag("i", > 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > {code} > Here is the stacktrace: > {quote} > org.apache.spark.sql.AnalysisException: Window Frame > specifiedwindowframe(RowFrame, -1, -1) must match the required frame > specifiedwindowframe(RowFrame, -1, -1); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at >
[jira] [Commented] (SPARK-24033) LAG Window function broken in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446090#comment-16446090 ] Apache Spark commented on SPARK-24033: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21115 > LAG Window function broken in Spark 2.3 > --- > > Key: SPARK-24033 > URL: https://issues.apache.org/jira/browse/SPARK-24033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Assignee: Xiao Li >Priority: Major > > The {{LAG}} window function appears to be broken in Spark 2.3.0, always > failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, > so it can be worked around by negating the lag and using lead instead. > Reproduction (run in {{spark-shell}}): > {code:java} > import org.apache.spark.sql.expressions.Window > val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i") > // The following works: > ds.withColumn("m", lead("i", > -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > // The following (equivalent) fails: > ds.withColumn("m", lag("i", > 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > {code} > Here is the stacktrace: > {quote} > org.apache.spark.sql.AnalysisException: Window Frame > specifiedwindowframe(RowFrame, -1, -1) must match the required frame > specifiedwindowframe(RowFrame, -1, -1); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029) > at >
[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24033: Assignee: Apache Spark (was: Xiao Li) > LAG Window function broken in Spark 2.3 > --- > > Key: SPARK-24033 > URL: https://issues.apache.org/jira/browse/SPARK-24033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Assignee: Apache Spark >Priority: Major > > The {{LAG}} window function appears to be broken in Spark 2.3.0, always > failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, > so it can be worked around by negating the lag and using lead instead. > Reproduction (run in {{spark-shell}}): > {code:java} > import org.apache.spark.sql.expressions.Window > val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i") > // The following works: > ds.withColumn("m", lead("i", > -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > // The following (equivalent) fails: > ds.withColumn("m", lag("i", > 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > {code} > Here is the stacktrace: > {quote} > org.apache.spark.sql.AnalysisException: Window Frame > specifiedwindowframe(RowFrame, -1, -1) must match the required frame > specifiedwindowframe(RowFrame, -1, -1); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at >
[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22371: Assignee: (was: Apache Spark) > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446084#comment-16446084 ] Apache Spark commented on SPARK-22371: -- User 'artemrd' has created a pull request for this issue: https://github.com/apache/spark/pull/21114 > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22371: Assignee: Apache Spark > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Assignee: Apache Spark >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-23775: Assignee: (was: Gabor Somogyi) > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24035) SQL syntax for Pivot
[ https://issues.apache.org/jira/browse/SPARK-24035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446058#comment-16446058 ] Xiao Li commented on SPARK-24035: - Also cc [~simeons] > SQL syntax for Pivot > > > Key: SPARK-24035 > URL: https://issues.apache.org/jira/browse/SPARK-24035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Maryann Xue >Priority: Major > > Some users who are SQL experts but don’t know an ounce of Scala/Python or R. > Thus, we prefer to supporting the SQL syntax for Pivot too -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23775: Assignee: (was: Apache Spark) > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23775: Assignee: Apache Spark > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23775: --- Fix Version/s: (was: 2.3.1) (was: 2.4.0) > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24033) LAG Window function broken in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24033: --- Assignee: Xiao Li > LAG Window function broken in Spark 2.3 > --- > > Key: SPARK-24033 > URL: https://issues.apache.org/jira/browse/SPARK-24033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Assignee: Xiao Li >Priority: Major > > The {{LAG}} window function appears to be broken in Spark 2.3.0, always > failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, > so it can be worked around by negating the lag and using lead instead. > Reproduction (run in {{spark-shell}}): > {code:java} > import org.apache.spark.sql.expressions.Window > val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i") > // The following works: > ds.withColumn("m", lead("i", > -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > // The following (equivalent) fails: > ds.withColumn("m", lag("i", > 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show > {code} > Here is the stacktrace: > {quote} > org.apache.spark.sql.AnalysisException: Window Frame > specifiedwindowframe(RowFrame, -1, -1) must match the required frame > specifiedwindowframe(RowFrame, -1, -1); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at >
[jira] [Created] (SPARK-24035) SQL syntax for Pivot
Xiao Li created SPARK-24035: --- Summary: SQL syntax for Pivot Key: SPARK-24035 URL: https://issues.apache.org/jira/browse/SPARK-24035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Maryann Xue Some users who are SQL experts but don’t know an ounce of Scala/Python or R. Thus, we prefer to supporting the SQL syntax for Pivot too -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445999#comment-16445999 ] Artem Rudoy edited comment on SPARK-22371 at 4/20/18 4:37 PM: -- Do we really need to throw an exception from AccumulatorContext.get() when an accumulator is garbage collected? There's a period of time when an accumulator has been garbage collected, but hasn't been removed from AccumulatorContext.originals by ContextCleaner. When an update is received for such accumulator it will throw an exception and kill the whole job. This can happen when a stage completed, but there're still running tasks from other attempts, speculation etc. Since AccumulatorContext.get() returns an Option we could just return None in such case. Before SPARK-20940 this method threw IllegalAccessError which is not a NonFatal, was caught at a lower level and didn't cause job failure. was (Author: rudoy): Do we really need to throw an exception from AccumulatorContext.get() when an accumulator is garbage collected? There's a period of time when an accumulator has been garbage collected, but hasn't been removed from AccumulatorContext.originals by ContextCleaner. When an update is received for such accumulator it will throw an exception and kill the whole job. This can happen when a stage completed, but there're still running tasks from other attempts, speculation etc. Since AccumulatorContext.get() returns an Option we could just return None in such case. Before SPARK-20940 this method thrown IllegalAccessError which is not a NonFatal, was caught at a lower level and didn't cause job failure. > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445999#comment-16445999 ] Artem Rudoy commented on SPARK-22371: - Do we really need to throw an exception from AccumulatorContext.get() when an accumulator is garbage collected? There's a period of time when an accumulator has been garbage collected, but hasn't been removed from AccumulatorContext.originals by ContextCleaner. When an update is received for such accumulator it will throw an exception and kill the whole job. This can happen when a stage completed, but there're still running tasks from other attempts, speculation etc. Since AccumulatorContext.get() returns an Option we could just return None in such case. Before SPARK-20940 this method thrown IllegalAccessError which is not a NonFatal, was caught at a lower level and didn't cause job failure. > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results
[ https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emilio Dorigatti updated SPARK-24034: - Description: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() {noformat} The result I get is `[57, 71, 85]` I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 was: Consider the following code def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() The result I get is `[57, 71, 85]` I think it happens because `map` is implemented in terms of `mapPartitionsWithIndex` using a custom iterator, so the StopIteration raised by the mapper is handled by that iterator. I think this should be raised to the user instead NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 > StopIteration in pyspark mapper results in partial results > -- > > Key: SPARK-24034 > URL: https://issues.apache.org/jira/browse/SPARK-24034 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Emilio Dorigatti >Priority: Major > > Consider the following code > {noformat} > def mapper(xx): > if xx % 2 == 0: > raise StopIteration() > else: > return xx > sc.parallelize(range(100)).map(mapper)collect() > {noformat} > The result I get is `[57, 71, 85]` > I think it happens because {{map }}is implemented in terms of > {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} > raised by the mapper is handled by that iterator. I think this should be > raised to the user instead > NB: this may be the underlying cause of > https://issues.apache.org/jira/browse/SPARK-23754 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24034) StopIteration in pyspark mapper results in partial results
Emilio Dorigatti created SPARK-24034: Summary: StopIteration in pyspark mapper results in partial results Key: SPARK-24034 URL: https://issues.apache.org/jira/browse/SPARK-24034 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Emilio Dorigatti Consider the following code def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() The result I get is `[57, 71, 85]` I think it happens because `map` is implemented in terms of `mapPartitionsWithIndex` using a custom iterator, so the StopIteration raised by the mapper is handled by that iterator. I think this should be raised to the user instead NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper gives partial results
[ https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emilio Dorigatti updated SPARK-24034: - Description: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper).collect() {noformat} The result I get is {{[57, 71, 85]}} I think it happens because {{map}} is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead. I think I can take care of this, if I am allowed to (first time I contribute, not sure how it works) NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 was: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper).collect() {noformat} The result I get is {{[57, 71, 85]}} I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead. I think I can take care of this, if I am allowed to (first time I contribute, not sure how it works) NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 > StopIteration in pyspark mapper gives partial results > - > > Key: SPARK-24034 > URL: https://issues.apache.org/jira/browse/SPARK-24034 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Emilio Dorigatti >Priority: Major > > Consider the following code > {noformat} > def mapper(xx): > if xx % 2 == 0: > raise StopIteration() > else: > return xx > sc.parallelize(range(100)).map(mapper).collect() > {noformat} > The result I get is {{[57, 71, 85]}} > I think it happens because {{map}} is implemented in terms of > {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} > raised by the mapper is handled by that iterator. I think this should be > raised to the user instead. > I think I can take care of this, if I am allowed to (first time I contribute, > not sure how it works) > NB: this may be the underlying cause of > https://issues.apache.org/jira/browse/SPARK-23754 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results
[ https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emilio Dorigatti updated SPARK-24034: - Description: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() {noformat} The result I get is {{[57, 71, 85]}} I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead. I think I can take care of this, if I am allowed to (first time I contribute, not sure how it works) NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 was: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() {noformat} The result I get is `[57, 71, 85]` I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 > StopIteration in pyspark mapper results in partial results > -- > > Key: SPARK-24034 > URL: https://issues.apache.org/jira/browse/SPARK-24034 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Emilio Dorigatti >Priority: Major > > Consider the following code > {noformat} > def mapper(xx): > if xx % 2 == 0: > raise StopIteration() > else: > return xx > sc.parallelize(range(100)).map(mapper)collect() > {noformat} > The result I get is {{[57, 71, 85]}} > I think it happens because {{map }}is implemented in terms of > {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} > raised by the mapper is handled by that iterator. I think this should be > raised to the user instead. > I think I can take care of this, if I am allowed to (first time I contribute, > not sure how it works) > NB: this may be the underlying cause of > https://issues.apache.org/jira/browse/SPARK-23754 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper results in partial results
[ https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emilio Dorigatti updated SPARK-24034: - Description: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper).collect() {noformat} The result I get is {{[57, 71, 85]}} I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead. I think I can take care of this, if I am allowed to (first time I contribute, not sure how it works) NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 was: Consider the following code {noformat} def mapper(xx): if xx % 2 == 0: raise StopIteration() else: return xx sc.parallelize(range(100)).map(mapper)collect() {noformat} The result I get is {{[57, 71, 85]}} I think it happens because {{map }}is implemented in terms of {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} raised by the mapper is handled by that iterator. I think this should be raised to the user instead. I think I can take care of this, if I am allowed to (first time I contribute, not sure how it works) NB: this may be the underlying cause of https://issues.apache.org/jira/browse/SPARK-23754 > StopIteration in pyspark mapper results in partial results > -- > > Key: SPARK-24034 > URL: https://issues.apache.org/jira/browse/SPARK-24034 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Emilio Dorigatti >Priority: Major > > Consider the following code > {noformat} > def mapper(xx): > if xx % 2 == 0: > raise StopIteration() > else: > return xx > sc.parallelize(range(100)).map(mapper).collect() > {noformat} > The result I get is {{[57, 71, 85]}} > I think it happens because {{map }}is implemented in terms of > {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} > raised by the mapper is handled by that iterator. I think this should be > raised to the user instead. > I think I can take care of this, if I am allowed to (first time I contribute, > not sure how it works) > NB: this may be the underlying cause of > https://issues.apache.org/jira/browse/SPARK-23754 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24034) StopIteration in pyspark mapper gives partial results
[ https://issues.apache.org/jira/browse/SPARK-24034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emilio Dorigatti updated SPARK-24034: - Summary: StopIteration in pyspark mapper gives partial results (was: StopIteration in pyspark mapper results in partial results) > StopIteration in pyspark mapper gives partial results > - > > Key: SPARK-24034 > URL: https://issues.apache.org/jira/browse/SPARK-24034 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Emilio Dorigatti >Priority: Major > > Consider the following code > {noformat} > def mapper(xx): > if xx % 2 == 0: > raise StopIteration() > else: > return xx > sc.parallelize(range(100)).map(mapper).collect() > {noformat} > The result I get is {{[57, 71, 85]}} > I think it happens because {{map }}is implemented in terms of > {{mapPartitionsWithIndex}} using a custom iterator, so the {{StopIteration}} > raised by the mapper is handled by that iterator. I think this should be > raised to the user instead. > I think I can take care of this, if I am allowed to (first time I contribute, > not sure how it works) > NB: this may be the underlying cause of > https://issues.apache.org/jira/browse/SPARK-23754 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.
[ https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seok-Joon,Yun updated SPARK-24030: -- Attachment: screenshot_2018-04-20 23.15.02.png > SparkSQL percentile_approx function is too slow for over 1,060,000 records. > --- > > Key: SPARK-24030 > URL: https://issues.apache.org/jira/browse/SPARK-24030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop. >Reporter: Seok-Joon,Yun >Priority: Major > Attachments: screenshot_2018-04-20 23.15.02.png > > > I used percentile_approx functions for over 1,060,000 records. It is too > slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about > 10 secs. > I tested for data reading on JDBC and parquet. It takes same time lengths. > I wonder that function is not designed for multi worker. > I looked gangglia and spark history. It worked on one worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23595. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24031: -- Affects Version/s: (was: 3.0.0) 2.3.0 Priority: Minor (was: Major) There's no real detail here. You'd have to state the problem and solution, and open a PR. Otherwise I'd close this. > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yu Wang >Priority: Minor > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24033) LAG Window function broken in Spark 2.3
Emlyn Corrin created SPARK-24033: Summary: LAG Window function broken in Spark 2.3 Key: SPARK-24033 URL: https://issues.apache.org/jira/browse/SPARK-24033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Emlyn Corrin The {{LAG}} window function appears to be broken in Spark 2.3.0, always failing with an AnalysisException. Interestingly, {{LEAD}} is not affected, so it can be worked around by negating the lag and using lead instead. Reproduction (run in {{spark-shell}}): {code:java} import org.apache.spark.sql.expressions.Window val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i") // The following works: ds.withColumn("m", lead("i", -1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show // The following (equivalent) fails: ds.withColumn("m", lag("i", 1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show {code} Here is the stacktrace: {quote} org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, -1, -1) must match the required frame specifiedwindowframe(RowFrame, -1, -1); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$.apply(Analyzer.scala:2029) at
[jira] [Updated] (SPARK-22823) Race Condition when reading Broadcast shuffle input. Failed to get broadcast piece
[ https://issues.apache.org/jira/browse/SPARK-22823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitrii Bundin updated SPARK-22823: --- Affects Version/s: 2.3.0 > Race Condition when reading Broadcast shuffle input. Failed to get broadcast > piece > -- > > Key: SPARK-22823 > URL: https://issues.apache.org/jira/browse/SPARK-22823 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.1, 2.2.1, 2.3.0 >Reporter: Dmitrii Bundin >Priority: Major > > It seems we have a race condition when trying to read shuffle input which is > a broadcast, not direct. To read broadcast MapStatuses at > {code:java} > org.apache.spark.shuffle.BlockStoreShuffleReader::read() > {code} > we submit a message of the type GetMapOutputStatuses(shuffleId) to be > executed in MapOutputTrackerMaster's pool which in turn ends up in creating a > new broadcast at > {code:java} > org.apache.spark.MapOutputTracker::serializeMapStatuses > {code} > if the received statuses bytes more than minBroadcastSize. > So registering the newly created broadcast in the driver's > BlockManagerMasterEndpoint may appear later than some executor asks for the > broadcast piece location from the driver. > In out project we get the following exception on the regular basis: > {code:java} > java.io.IOException: org.apache.spark.SparkException: Failed to get > broadcast_176_piece0 of broadcast_176 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661) > at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:598) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:660) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:203) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:142) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to get > broadcast_176_piece0 of broadcast_176 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273) > {code} > This exception is appeared when we try to read a broadcast piece. To do this > we need to fetch the broadcast piece location from the driver > {code:java} > org.apache.spark.storage.BlockManagerMaster::getLocations(blockId: BlockId) > {code} > . The driver responses with empty list of
[jira] [Resolved] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24032. -- Resolution: Invalid I'd not open an issue unless it's clear. Let me leave this resolved for now but please let me know if new fact comes. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Major > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445522#comment-16445522 ] Cristina Luengo commented on SPARK-24032: - Thanks for the very fast reply! I wasn't sure whether it was fully and ES problem or a Spark one, but I also escalated the issue to ES. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Major > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445519#comment-16445519 ] Hyukjin Kwon commented on SPARK-24032: -- Yea, doesn't that look a problem from elasticsearch thirdparty library? I'd resolve this JIRA in Spark side. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Major > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445520#comment-16445520 ] Jacek Laskowski commented on SPARK-24025: - it seems related or duplicated > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24032: - Priority: Major (was: Critical) > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Major > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445516#comment-16445516 ] Hyukjin Kwon commented on SPARK-24025: -- I haven't taken a close look yet but it should be good to just link it with "relates to" with a comment that it seems related or duplicated, for now. Further actions like investigation, testing or fixing should even better of course. We can change the link when a better fact arrives :-). > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445514#comment-16445514 ] Cristina Luengo commented on SPARK-24032: - Ok thanks!! Sorry, I didn't really know what to assign it! The full stack trace is: Traceback (most recent call last): File "/home/cluengo/Documents/Projects/vcfLoader/python/rdconnect/main.py", line 151, in main(hc,sqlContext) File "/home/cluengo/Documents/Projects/vcfLoader/python/rdconnect/main.py", line 141, in main variantsRN.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') File "/home/cluengo/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 595, in save File "/home/cluengo/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/home/cluengo/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/home/cluengo/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o181.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage 41.0 (TID 200, localhost, executor driver): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 at org.elasticsearch.spark.sql.DataFrameValueWriter.write(DataFrameValueWriter.scala:53) at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.doWrite(AbstractBulkFactory.java:152) at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:118) at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80) at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56) at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168) at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67) at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101) at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075) at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:101) at org.elasticsearch.spark.sql.ElasticsearchRelation.insert(DefaultSource.scala:610) at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:103) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) at
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445512#comment-16445512 ] Jacek Laskowski commented on SPARK-24025: - I was about to have closed this as a duplicate, but I'm not so sure anymore. Reading the following in SPARK-17570: {quote}If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those. {quote} I'm no longer so sure that the two issues are perfectly identical, but they're close "relatives". I think the next step would be to write a test case to reproduce this issue and the fairly simple fix would be to have an physical optimization (rule) that would eliminate one extra Exchange and Sort ops. Who'd help me here? > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445510#comment-16445510 ] Hyukjin Kwon commented on SPARK-24032: -- BTW, please avoid to set a Critical+ which is usually reserved for committers. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Major > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445507#comment-16445507 ] Hyukjin Kwon commented on SPARK-24032: -- Can you post the full stack trace? I doubt if it's a Spark issue. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Critical > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16854) mapWithState Support for Python
[ https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445503#comment-16445503 ] Fatma Ali commented on SPARK-16854: --- +1 > mapWithState Support for Python > --- > > Key: SPARK-16854 > URL: https://issues.apache.org/jira/browse/SPARK-16854 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 >Reporter: Boaz >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristina Luengo updated SPARK-24032: Description: I'm trying to update a nested field on ElasticSearch using a scripted update. The code I'm using is: update_params = "new_samples: samples" update_script = "ctx._source.samples += new_samples" es_conf = { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": "upsert", "es.update.script.params": update_params, "es.update.script.inline": update_script } result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') And the schema of the field is: |– samples: array (nullable = true)| | |– element: struct (containsNull = true)| | | |– gq: integer (nullable = true)| | | |– dp: integer (nullable = true)| | | |– gt: string (nullable = true)| | | |– adBug: array (nullable = true)| | | | |– element: integer (containsNull = true)| | | |– ad: double (nullable = true)| | | |– sample: string (nullable = true)| And I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 Am I doing something wrong or is this a bug? Thanks. was: I'm trying to update a nested field using Spark and a scripted update. The code I'm using is: update_params = "new_samples: samples" update_script = "ctx._source.samples += new_samples" es_conf = { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": "upsert", "es.update.script.params": update_params, "es.update.script.inline": update_script } result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') And the schema of the field is: |-- samples: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- gq: integer (nullable = true) | | |-- dp: integer (nullable = true) | | |-- gt: string (nullable = true) | | |-- adBug: array (nullable = true) | | | |-- element: integer (containsNull = true) | | |-- ad: double (nullable = true) | | |-- sample: string (nullable = true) And I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 Am I doing something wrong or is this a bug? Thanks. > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Critical > > I'm trying to update a nested field on ElasticSearch using a scripted update. > The code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = > { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": > "upsert", "es.update.script.params": update_params, > "es.update.script.inline": update_script } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |– samples: array (nullable = true)| > | |– element: struct (containsNull = true)| > | | |– gq: integer (nullable = true)| > | | |– dp: integer (nullable = true)| > | | |– gt: string (nullable = true)| > | | |– adBug: array (nullable = true)| > | | | |– element: integer (containsNull = true)| > | | |– ad: double (nullable = true)| > | | |– sample: string (nullable = true)| > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0
[jira] [Created] (SPARK-24032) ElasticSearch updata fails on nested field
Cristina Luengo created SPARK-24032: --- Summary: ElasticSearch updata fails on nested field Key: SPARK-24032 URL: https://issues.apache.org/jira/browse/SPARK-24032 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Cristina Luengo I'm trying to update a nested field using Spark and a scripted update. The code I'm using is: update_params = "new_samples: samples" update_script = "ctx._source.samples += new_samples" es_conf = { "es.mapping.id": "id", "es.mapping.exclude": "id", "es.write.operation": "upsert", "es.update.script.params": update_params, "es.update.script.inline": update_script } result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') And the schema of the field is: |-- samples: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- gq: integer (nullable = true) | | |-- dp: integer (nullable = true) | | |-- gt: string (nullable = true) | | |-- adBug: array (nullable = true) | | | |-- element: integer (containsNull = true) | | |-- ad: double (nullable = true) | | |-- sample: string (nullable = true) And I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24032) ElasticSearch update fails on nested field
[ https://issues.apache.org/jira/browse/SPARK-24032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristina Luengo updated SPARK-24032: Summary: ElasticSearch update fails on nested field (was: ElasticSearch updata fails on nested field) > ElasticSearch update fails on nested field > -- > > Key: SPARK-24032 > URL: https://issues.apache.org/jira/browse/SPARK-24032 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Cristina Luengo >Priority: Critical > > I'm trying to update a nested field using Spark and a scripted update. The > code I'm using is: > update_params = "new_samples: samples" > update_script = "ctx._source.samples += new_samples" > es_conf = { > "es.mapping.id": "id", > "es.mapping.exclude": "id", > "es.write.operation": "upsert", > "es.update.script.params": update_params, > "es.update.script.inline": update_script > } > result.write.format("org.elasticsearch.spark.sql").options(**es_conf).option("es.nodes",configuration["elasticsearch"]["host"]).option("es.port",configuration["elasticsearch"]["port"] > > ).save(configuration["elasticsearch"]["index_name"]+"/"+configuration["version"],mode='append') > And the schema of the field is: > |-- samples: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- gq: integer (nullable = true) > | | |-- dp: integer (nullable = true) > | | |-- gt: string (nullable = true) > | | |-- adBug: array (nullable = true) > | | | |-- element: integer (containsNull = true) > | | |-- ad: double (nullable = true) > | | |-- sample: string (nullable = true) > And I get the following error: > py4j.protocol.Py4JJavaError: An error occurred while calling o83.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1, localhost, executor driver): java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.Tuple2 > Am I doing something wrong or is this a bug? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13136) Data exchange (shuffle, broadcast) should only be handled by the exchange operator
[ https://issues.apache.org/jira/browse/SPARK-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445428#comment-16445428 ] Apache Spark commented on SPARK-13136: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/21113 > Data exchange (shuffle, broadcast) should only be handled by the exchange > operator > -- > > Key: SPARK-13136 > URL: https://issues.apache.org/jira/browse/SPARK-13136 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.0.0 > > > In an ideal architecture, we have a very small number of physical operators > that handle data exchanges, and the rest simply declare the input data > distribution needed and let the planner inject the right Exchange operators. > We have almost that, except the following few operators: > 1. Limit: does its own shuffle or collect to get data to a single partition. > 2. Except: does its own shuffle; note that this operator is going away and > will be replaced by anti-join (SPARK-12660). > 3. broadcast joins: broadcast joins do its own broadcast, which is a form of > data exchange. > Here are a straw man for limit. Split the current Limit operator into two: a > partition-local limit and a terminal limit. Partition-local limit is just a > normal unary operator. The terminal limit requires the input data > distribution to be a single partition, and then takes its own limit. We then > update the planner (strategies) to turn a logical limit into a partition > local limit and a terminal limit. > For broadcast join, it is more involved. We would need to design the > interface for the physical operators (e.g. we are no longer taking an > iterator as input on the probe side), and allow Exchange to handle data > broadcast. > Note that this is an important step towards creating a clear delineation > between distributed query execution and single-threaded query execution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23588) Add interpreted execution for CatalystToExternalMap expression
[ https://issues.apache.org/jira/browse/SPARK-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445382#comment-16445382 ] Apache Spark commented on SPARK-23588: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21112 > Add interpreted execution for CatalystToExternalMap expression > -- > > Key: SPARK-23588 > URL: https://issues.apache.org/jira/browse/SPARK-23588 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445381#comment-16445381 ] Yu Wang commented on SPARK-24031: - @[~srowen] [~joshrosen] please have a look. > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yu Wang >Priority: Major > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang reopened SPARK-24031: - > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yu Wang >Priority: Major > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang resolved SPARK-24031. - Resolution: Fixed > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yu Wang >Priority: Major > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-24031: Target Version/s: (was: 3.0.0) Fix Version/s: (was: 3.0.0) > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yu Wang >Priority: Major > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-23705: Comment: was deleted (was: [~khoatrantan2000] Could you assign this patch to me?) > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-23705: Attachment: (was: SPARK-23705.patch) > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
[ https://issues.apache.org/jira/browse/SPARK-24031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-24031: Attachment: 24031_master_1.patch > the method of postTaskEnd should write once in handleTaskCompletion > --- > > Key: SPARK-24031 > URL: https://issues.apache.org/jira/browse/SPARK-24031 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yu Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: 24031_master_1.patch > > > the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24031) the method of postTaskEnd should write once in handleTaskCompletion
Yu Wang created SPARK-24031: --- Summary: the method of postTaskEnd should write once in handleTaskCompletion Key: SPARK-24031 URL: https://issues.apache.org/jira/browse/SPARK-24031 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Yu Wang Fix For: 3.0.0 the method of postTaskEnd should write once in handleTaskCompletion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org