[jira] [Resolved] (SPARK-26826) Array indexing functions array_allpositions and array_select
[ https://issues.apache.org/jira/browse/SPARK-26826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic resolved SPARK-26826. --- Resolution: Won't Fix > Array indexing functions array_allpositions and array_select > > > Key: SPARK-26826 > URL: https://issues.apache.org/jira/browse/SPARK-26826 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 > Reporter: Petar Zecevic >Priority: Major > > This ticket proposes two extra array functions: {{array_allpositions}} (named > after {{array_position}}) and {{array_select}}. These functions should make > it easier to: > * get an array of indices of all occurences of a value in an array > ({{array_allpositions}}) > * select all elements of an array based on an array of indices > ({{array_select}}) > Although higher-order functions, such as {{aggregate}} and {{transform}}, > have been recently added, performing tasks above is still not simple, hence > this addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26826) Array indexing functions array_allpositions and array_select
[ https://issues.apache.org/jira/browse/SPARK-26826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-26826: -- Description: This ticket proposes two extra array functions: {{array_allpositions}} (named after {{array_position}}) and {{array_select}}. These functions should make it easier to: * get an array of indices of all occurences of a value in an array ({{array_allpositions}}) * select all elements of an array based on an array of indices ({{array_select}}) Although higher-order functions, such as {{aggregate}} and {{transform}}, have been recently added, performing tasks above is still not simple, hence this addition. was: This ticket proposes two extra array functions: `array_allpositions` (named after `array_position`) and `array_select`. These functions should make it easier to: * get an array of indices of all occurences of a value in an array (`array_allpositions`) * select all elements of an array based on an array of indices (`array_select`) Although higher-order functions, such as `aggregate` and `transform`, have been recently added, performing tasks above is still not simple, hence this addition. > Array indexing functions array_allpositions and array_select > > > Key: SPARK-26826 > URL: https://issues.apache.org/jira/browse/SPARK-26826 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 > Reporter: Petar Zecevic >Priority: Major > > This ticket proposes two extra array functions: {{array_allpositions}} (named > after {{array_position}}) and {{array_select}}. These functions should make > it easier to: > * get an array of indices of all occurences of a value in an array > ({{array_allpositions}}) > * select all elements of an array based on an array of indices > ({{array_select}}) > Although higher-order functions, such as {{aggregate}} and {{transform}}, > have been recently added, performing tasks above is still not simple, hence > this addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26826) Array indexing functions array_allpositions and array_select
Petar Zecevic created SPARK-26826: - Summary: Array indexing functions array_allpositions and array_select Key: SPARK-26826 URL: https://issues.apache.org/jira/browse/SPARK-26826 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Petar Zecevic This ticket proposes two extra array functions: `array_allpositions` (named after `array_position`) and `array_select`. These functions should make it easier to: * get an array of indices of all occurences of a value in an array (`array_allpositions`) * select all elements of an array based on an array of indices (`array_select`) Although higher-order functions, such as `aggregate` and `transform`, have been recently added, performing tasks above is still not simple, hence this addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Jenkins build errors
The problem was with the changes upstream. fetch upstream and a rebase resolved it and now the build is passing. I also added a design doc and made the JIRA description a bit clearer (https://issues.apache.org/jira/browse/SPARK-24020) so I hope it will get merged soon. Thanks, Petar Sean Owen @ 1970-01-01 01:00 CET: > Also confused about this one as many builds succeed. One possible difference > is that this failure is in the Hive tests, so are you building and testing > with -Phive locally where it works? still does not explain the download > failure. It could be a mirror > problem, throttling, etc. But there again haven't spotted another failing > Hive test. > > On Wed, Jun 20, 2018 at 1:55 AM Petar Zecevic wrote: > > It's still dying. Back to this error (it used to be spark-2.2.0 before): > > java.io.IOException: Cannot run program "./bin/spark-submit" (in directory > "/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory > So, a mirror is missing that Spark version... I don't understand why nobody > else has these errors and I get them every time without fail. > > Petar - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h3. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h3. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h2. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h3. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h3. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h3. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h2. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h2. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h2. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h2. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. We hereby propose an optimization that would allow you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Attachment: SMJ-innerRange-PR24020-designDoc.pdf > Sort-merge join inner range optimization > > > Key: SPARK-24020 > URL: https://issues.apache.org/jira/browse/SPARK-24020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 > Reporter: Petar Zecevic >Priority: Major > Attachments: SMJ-innerRange-PR24020-designDoc.pdf > > > The problem we are solving is the case where you have two big tables > partitioned by X column, but also sorted by Y column (within partitions) and > you need to calculate an expensive function on the joined rows. During a > sort-merge join, Spark will do cross-joins of all rows that have the same X > values and calculate the function's value on all of them. If the two tables > have a large number of rows per X, this can result in a huge number of > calculations. > We hereby propose an optimization that would allow you to reduce the number > of matching rows per X using a range condition on Y columns of the two > tables. Something like: > ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d > The way SMJ is currently implemented, these extra conditions have no > influence on the number of rows (per X) being checked because these extra > conditions are put in the same block with the function being calculated. > Here we propose a change to the sort-merge join so that, when these extra > conditions are specified, a queue is used instead of the > ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a > moving window across the values from the right relation as the left row > changes. You could call this a combination of an equi-join and a theta join > (we call it "sort-merge inner range join"). > Potential use-cases for this are joins based on spatial or temporal distance > calculations. > The optimization should be triggered automatically when an equi-join > expression is present AND lower and upper range conditions on a secondary > column are specified. If the tables aren't sorted by both columns, > appropriate sorts should be added. > To limit the impact of this change we also propose adding a new parameter > (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which > could be used to switch off the optimization entirely. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Jenkins build errors
It's still dying. Back to this error (it used to be spark-2.2.0 before): java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory So, a mirror is missing that Spark version... I don't understand why nobody else has these errors and I get them every time without fail. Petar Le 6/19/2018 à 2:35 PM, Sean Owen a écrit : Those still appear to be env problems. I don't know why it is so persistent. Does it all pass locally? Retrigger tests again and see what happens. On Tue, Jun 19, 2018, 2:53 AM Petar Zecevic <mailto:petar.zece...@gmail.com>> wrote: Thanks, but unfortunately, it died again. Now at pyspark tests: Running PySpark tests Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder@2/python/unit-tests.log Will test against the following Python executables: ['python2.7', 'python3.4', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas 0.16.0 was found. Will test PyArrow related features against Python executable 'python3.4' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python3.4' in 'pyspark-sql' module. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas was not found. Starting test(python2.7): pyspark.mllib.tests Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.tests Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). ... [Stage 0:> (0 + 1) / 1] .. [Stage 0:> (0 + 4) / 4] . [Stage 0:> (0 + 4) / 4] .. [Stage 0:> (0 + 4) / 4] [Stage 0:> (0 + 4) / 4] [Stage 0:> (0 + 4) / 4] [Stage 0:> (0 + 4) / 4] [Stage 0:>(0 + 32) / 32]... [Stage 10:> (0 + 1) / 1] . [Stage 0:> (0 + 4) / 4] .s [Stage 0:> (0 + 1) / 1] . [Stage 0:> (0 + 4) / 4] [Stage 0:==>(1 + 3) / 4] . [Stage 0:> (0 + 4) / 4] .. [Stage 0:> (0 + 2) / 2]
Re: Jenkins build errors
/0.1/mylib-0.1.pom -- artifact a#mylib;0.1!mylib.jar: https://repo1.maven.org/maven2/a/mylib/0.1/mylib-0.1.jar spark-packages: tried http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.pom -- artifact a#mylib;0.1!mylib.jar: http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.jar repo-1: tried file:/tmp/tmpgO7AIY/a/mylib/0.1/mylib-0.1.pom :: :: UNRESOLVED DEPENDENCIES :: :: :: a#mylib;0.1: not found :: :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: a#mylib;0.1: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1268) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:348) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Ffile:/tmp/tmpwtN2z_ added as a remote repository with the name: repo-1 Ivy Default Cache set to: /home/jenkins/.ivy2/cache The jars for the packages stored in: /home/jenkins/.ivy2/jars :: loading settings :: url = jar:file:/home/jenkins/workspace/SparkPullRequestBuilder@2/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml a#mylib added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found a#mylib;0.1 in repo-1 :: resolution report :: resolve 1378ms :: artifacts dl 4ms :: modules in use: a#mylib;0.1 from repo-1 in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 1 | 1 | 1 | 0 || 1 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/8ms) . [Stage 0:> (0 + 4) / 4] . [Stage 0:> (0 + 1) / 1] . [Stage 0:> (0 + 1) / 1] ... [Stage 0:> (0 + 4) / 20] [Stage 0:=>(6 + 4) / 20] .. == FAIL: test_package_dependency (pyspark.tests.SparkSubmitTests) Submit and test a script with a dependency on a Spark Package -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 2093, in test_package_dependency self.assertEqual(0, proc.returncode) AssertionError: 0 != 1 -- Ran 127 tests in 205.547s FAILED (failures=1, skipped=2) NOTE: Skipping SciPy tests as it does not seem to be installed NOTE: Skipping NumPy tests as it does not seem to be installed Random listing order was used Had test failures in pyspark.tests with pypy; see logs. [error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/python/run-tests --parallelism=4 ; received return code 255 Attempting to post to Github... > Post successful. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92038/ Test FAILed. Finished: FAILURE Le 6/18/2018 à 8:05 PM, shane knapp a écrit : i triggered another build against your PR, so let's see if this happens again or was a transient failure. https://amplab.cs.berkeley.edu/jenkins/jo
Jenkins build errors
Hi, Jenkins build for my PR (https://github.com/apache/spark/pull/21109 ; https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92023/testReport/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/) keeps failing. First it couldn't download Spark v.2.2.0 (indeed, it wasn't available at the mirror it selected), now it's failing with this exception below. Can someone explain these errors for me? Is anybody else experiencing similar problems? Thanks, Petar Error Message java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory Stacktrace sbt.ForkMain$ForkError: java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:248) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 17 more
Re: Sort-merge join improvement
Hi, we went through a round of reviews on this PR. Performance improvements can be substantial and there are unit and performance tests included. One remark was that the amount of changed code is large but I don't see how to reduce it and still keep the performance improvements. Besides, all the new code is well contained in separate classes (unless it was necessary to change existing ones). So I believe this is ready to be merged. Can some of the committers please take another look at this and accept the PR? Thank you, Petar Zecevic Le 5/15/2018 à 10:55 AM, Petar Zecevic a écrit : Based on some reviews I put additional effort into fixing the case when wholestage codegen is turned off. Sort-merge join with additional range conditions is now 10x faster (can be more or less, depending on exact use-case) in both cases - with wholestage turned off or on - compared to non-optimized SMJ. Merging this would help us tremendously and I believe this can be useful in other applications, too. Can you please review (https://github.com/apache/spark/pull/21109) and merge the patch? Thank you, Petar Zecevic Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit : Hi, the PR tests completed successfully (https://github.com/apache/spark/pull/21109). Can you please review the patch and merge it upstream if you think it's OK? Thanks, Petar Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit : As instructed offline, I opened a JIRA for this: https://issues.apache.org/jira/browse/SPARK-24020 I will create a pull request soon. Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit : Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of Spark's sort-merge join (SMJ) which has improved performance of our jobs considerably and we would like to know if Spark community thinks it would be useful to include this in the main distribution. The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. Our optimization allows you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Our optimization changes the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization is triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts will be added. We have several questions: 1. Do you see any other way to optimize queries like these (eliminate unnecessary calculations) without changing the sort-merge join algorithm? 2. We believe there is a more general pattern here and that this could help in other similar situations where secondary sorting is available. Would you agree? 3. Would you like us to open a JIRA ticket and create a pull request? Thanks, Petar Zecevic - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Sort-merge join improvement
Based on some reviews I put additional effort into fixing the case when wholestage codegen is turned off. Sort-merge join with additional range conditions is now 10x faster (can be more or less, depending on exact use-case) in both cases - with wholestage turned off or on - compared to non-optimized SMJ. Merging this would help us tremendously and I believe this can be useful in other applications, too. Can you please review (https://github.com/apache/spark/pull/21109) and merge the patch? Thank you, Petar Zecevic Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit : Hi, the PR tests completed successfully (https://github.com/apache/spark/pull/21109). Can you please review the patch and merge it upstream if you think it's OK? Thanks, Petar Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit : As instructed offline, I opened a JIRA for this: https://issues.apache.org/jira/browse/SPARK-24020 I will create a pull request soon. Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit : Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of Spark's sort-merge join (SMJ) which has improved performance of our jobs considerably and we would like to know if Spark community thinks it would be useful to include this in the main distribution. The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. Our optimization allows you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Our optimization changes the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization is triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts will be added. We have several questions: 1. Do you see any other way to optimize queries like these (eliminate unnecessary calculations) without changing the sort-merge join algorithm? 2. We believe there is a more general pattern here and that this could help in other similar situations where secondary sorting is available. Would you agree? 3. Would you like us to open a JIRA ticket and create a pull request? Thanks, Petar Zecevic - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Sort-merge join improvement
Hi, the PR tests completed successfully (https://github.com/apache/spark/pull/21109). Can you please review the patch and merge it upstream if you think it's OK? Thanks, Petar Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit : As instructed offline, I opened a JIRA for this: https://issues.apache.org/jira/browse/SPARK-24020 I will create a pull request soon. Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit : Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of Spark's sort-merge join (SMJ) which has improved performance of our jobs considerably and we would like to know if Spark community thinks it would be useful to include this in the main distribution. The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. Our optimization allows you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Our optimization changes the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization is triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts will be added. We have several questions: 1. Do you see any other way to optimize queries like these (eliminate unnecessary calculations) without changing the sort-merge join algorithm? 2. We believe there is a more general pattern here and that this could help in other similar situations where secondary sorting is available. Would you agree? 3. Would you like us to open a JIRA ticket and create a pull request? Thanks, Petar Zecevic - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444335#comment-16444335 ] Petar Zecevic commented on SPARK-24020: --- No, this implementation only applies to equi-joins that have range conditions on different columns. You can think of it as an equi-join with "sub-band" conditions. Hence the name we gave it ("sort-merge inner range join"). > Sort-merge join inner range optimization > > > Key: SPARK-24020 > URL: https://issues.apache.org/jira/browse/SPARK-24020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Petar Zecevic >Priority: Major > > The problem we are solving is the case where you have two big tables > partitioned by X column, but also sorted by Y column (within partitions) and > you need to calculate an expensive function on the joined rows. During a > sort-merge join, Spark will do cross-joins of all rows that have the same X > values and calculate the function's value on all of them. If the two tables > have a large number of rows per X, this can result in a huge number of > calculations. > We hereby propose an optimization that would allow you to reduce the number > of matching rows per X using a range condition on Y columns of the two > tables. Something like: > ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d > The way SMJ is currently implemented, these extra conditions have no > influence on the number of rows (per X) being checked because these extra > conditions are put in the same block with the function being calculated. > Here we propose a change to the sort-merge join so that, when these extra > conditions are specified, a queue is used instead of the > ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a > moving window across the values from the right relation as the left row > changes. You could call this a combination of an equi-join and a theta join > (we call it "sort-merge inner range join"). > Potential use-cases for this are joins based on spatial or temporal distance > calculations. > The optimization should be triggered automatically when an equi-join > expression is present AND lower and upper range conditions on a secondary > column are specified. If the tables aren't sorted by both columns, > appropriate sorts should be added. > To limit the impact of this change we also propose adding a new parameter > (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which > could be used to switch off the optimization entirely. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Sort-merge join improvement
As instructed offline, I opened a JIRA for this: https://issues.apache.org/jira/browse/SPARK-24020 I will create a pull request soon. Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit : Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of Spark's sort-merge join (SMJ) which has improved performance of our jobs considerably and we would like to know if Spark community thinks it would be useful to include this in the main distribution. The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. Our optimization allows you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Our optimization changes the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization is triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts will be added. We have several questions: 1. Do you see any other way to optimize queries like these (eliminate unnecessary calculations) without changing the sort-merge join algorithm? 2. We believe there is a more general pattern here and that this could help in other similar situations where secondary sorting is available. Would you agree? 3. Would you like us to open a JIRA ticket and create a pull request? Thanks, Petar Zecevic - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Created] (SPARK-24020) Sort-merge join inner range optimization
Petar Zecevic created SPARK-24020: - Summary: Sort-merge join inner range optimization Key: SPARK-24020 URL: https://issues.apache.org/jira/browse/SPARK-24020 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Petar Zecevic The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. We hereby propose an optimization that would allow you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Here we propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Sort-merge join improvement
Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of Spark's sort-merge join (SMJ) which has improved performance of our jobs considerably and we would like to know if Spark community thinks it would be useful to include this in the main distribution. The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. Our optimization allows you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Our optimization changes the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization is triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts will be added. We have several questions: 1. Do you see any other way to optimize queries like these (eliminate unnecessary calculations) without changing the sort-merge join algorithm? 2. We believe there is a more general pattern here and that this could help in other similar situations where secondary sorting is available. Would you agree? 3. Would you like us to open a JIRA ticket and create a pull request? Thanks, Petar Zecevic - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200028#comment-15200028 ] Petar Zecevic commented on SPARK-13313: --- Ok, thanks for reporting. I'll look into this. > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 > Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147007#comment-15147007 ] Petar Zecevic commented on SPARK-13313: --- No, I don't think it's got anything to do with that. That largest SCC's vertices are not connected in any way and they shouldn't be in the same group. > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 > Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146731#comment-15146731 ] Petar Zecevic commented on SPARK-13313: --- Yes, you need articles.tsv and links.tsv from this archive: http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz Then parse the data, assign IDs to article names and create the graph: val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")).zipWithIndex().cache() val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")) val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) }).join(articles).map(x => x._2).join(articles).map(x => x._2) val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0) Then get strongly connected components: val wikiSCC = wikigraph.stronglyConnectedComponents(100) wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC in wikiSCC has 4051 vertices and that's obviously wrong. The change in line 89, which I mentioned, seems to solve this problem, but then other issues arise (stack overflow etc) and I don't have time to investigate further. I hope someone will look into this. > Strongly connected components doesn't find all strongly connected components > > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13313) Strongly connected components doesn't find all strongly connected components
Petar Zecevic created SPARK-13313: - Summary: Strongly connected components doesn't find all strongly connected components Key: SPARK-13313 URL: https://issues.apache.org/jira/browse/SPARK-13313 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.6.0 Reporter: Petar Zecevic Strongly connected components algorithm doesn't find all strongly connected components. I was using Wikispeedia dataset (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 SCCs and one of them had 4051 vertices, which in reality don't have any edges between them. I think the problem could be on line 89 of StronglyConnectedComponents.scala file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe the second Pregel call should use Out edge direction, the same as the first call because the direction is reversed in the provided sendMsg function (message is sent to source vertex and not destination vertex). If that is changed (line 89), the algorithm starts finding much more SCCs, but eventually stack overflow exception occurs. I believe graph objects that are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Is spark suitable for real time query
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: MapString, String t11opt = new HashMapString, String(); t11opt.put(url, DB_URL); t11opt.put(dbtable, t11); DataFrame t11 = sqlContext.load(jdbc, t11opt); t11.registerTempTable(t11); ...the same for t12, t21, t22 DataFrame t1 = t11.unionAll(t12); t1.registerTempTable(t1); DataFrame t2 = t21.unionAll(t22); t2.registerTempTable(t2); for (int i = 0; i 10; i ++) { System.out.println(new Date(System.currentTimeMillis())); DataFrame crossjoin = sqlContext.sql(select txt from t1 join t2 on t1.id http://t1.id = t2.id http://t2.id); crossjoin.show(); System.out.println(new Date(System.currentTimeMillis())); } Where t11,t12, t21,t22 are all table dataframe load from jdbc of mysql database which is at local with the spark job. But each loop execute about 3 seconds. i do not know why cost so many time? 2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk mailto:robin.e...@xense.co.uk: Here’s an example using spark-shell on my laptop: sc.textFile(LICENSE).filter(_ contains Spark).count This takes less than a second the first time I run it and is instantaneous on every subsequent run. What code are you running? On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com mailto:louis.h...@gmail.com wrote: I do a simple test using spark in standalone mode(not cluster), and found a simple action take a few seconds, the data size is small, just few rows. So each spark job will cost some time for init or prepare work no matter what the job is? I mean if the basic framework of spark job will cost seconds? 2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk mailto:robin.e...@xense.co.uk: Real-time is, of course, relative but you’ve mentioned microsecond level. Spark is designed to process large amounts of data in a distributed fashion. No distributed system I know of could give any kind of guarantees at the microsecond level. Robin On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com mailto:louis.h...@gmail.com wrote: Hi, all I am using spark jar in standalone mode, fetch data from different mysql instance and do some action, but i found the time is at second level. So i want to know if spark job is suitable for real time query which at microseconds?
Re: Spark - Eclipse IDE - Maven
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in Spark in Action book: http://www.manning.com/bonaci/ On 27.7.2015. 10:22, Akhil Das wrote: You can follow this doc https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup Thanks Best Regards On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com mailto:ksiv...@gmail.com wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects for developing spark programs. I am having some issues and I am not sure what is the issue. Can Anyone share a nice step-step document to configure eclipse with maven for spark development. Thanks Siva -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Spark - Eclipse IDE - Maven
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in Spark in Action book: http://www.manning.com/bonaci/ On 24.7.2015. 7:26, Siva Reddy wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects for developing spark programs. I am having some issues and I am not sure what is the issue. Can Anyone share a nice step-step document to configure eclipse with maven for spark development. Thanks Siva -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is spark suitable for real time query
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: MapString, String t11opt = new HashMapString, String(); t11opt.put(url, DB_URL); t11opt.put(dbtable, t11); DataFrame t11 = sqlContext.load(jdbc, t11opt); t11.registerTempTable(t11); ...the same for t12, t21, t22 DataFrame t1 = t11.unionAll(t12); t1.registerTempTable(t1); DataFrame t2 = t21.unionAll(t22); t2.registerTempTable(t2); for (int i = 0; i 10; i ++) { System.out.println(new Date(System.currentTimeMillis())); DataFrame crossjoin = sqlContext.sql(select txt from t1 join t2 on t1.id http://t1.id = t2.id http://t2.id); crossjoin.show(); System.out.println(new Date(System.currentTimeMillis())); } Where t11,t12, t21,t22 are all table dataframe load from jdbc of mysql database which is at local with the spark job. But each loop execute about 3 seconds. i do not know why cost so many time? 2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk mailto:robin.e...@xense.co.uk: Here’s an example using spark-shell on my laptop: sc.textFile(LICENSE).filter(_ contains Spark).count This takes less than a second the first time I run it and is instantaneous on every subsequent run. What code are you running? On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com mailto:louis.h...@gmail.com wrote: I do a simple test using spark in standalone mode(not cluster), and found a simple action take a few seconds, the data size is small, just few rows. So each spark job will cost some time for init or prepare work no matter what the job is? I mean if the basic framework of spark job will cost seconds? 2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk mailto:robin.e...@xense.co.uk: Real-time is, of course, relative but you’ve mentioned microsecond level. Spark is designed to process large amounts of data in a distributed fashion. No distributed system I know of could give any kind of guarantees at the microsecond level. Robin On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com mailto:louis.h...@gmail.com wrote: Hi, all I am using spark jar in standalone mode, fetch data from different mysql instance and do some action, but i found the time is at second level. So i want to know if spark job is suitable for real time query which at microseconds?
Re: Fwd: Model weights of linear regression becomes abnormal values
You probably need to scale the values in the data set so that they are all of comparable ranges and translate them so that their means get to 0. You can use pyspark.mllib.feature.StandardScaler(True, True) object for that. On 28.5.2015. 6:08, Maheshakya Wijewardena wrote: Hi, I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with the attached dataset. The code is attached. When I check the model weights vector after training, it contains `nan` values. [nan,nan,nan,nan,nan,nan,nan,nan] But for some data sets, this problem does not occur. What might be the reason for this? Is this an issue with the data I'm using or a bug? Best regards. -- Pruthuvi Maheshakya Wijewardena Software Engineer WSO2 Lanka (Pvt) Ltd Email: mahesha...@wso2.com mailto:mahesha...@wso2.com Mobile: +94711228855/* */ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390206#comment-14390206 ] Petar Zecevic commented on SPARK-6646: -- Good one :) Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: How to configure SparkUI to use internal ec2 ip
Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh? On 31.3.2015. 19:19, Anny Chen wrote: Hi Akhil, I tried editing the /etc/hosts on the master and on the workers, and seems it is not working for me. I tried adding hostname internal-ip and it didn't work. I then tried adding internal-ip hostname and it didn't work either. I guess I should also edit the spark-env.sh file? Thanks! Anny On Mon, Mar 30, 2015 at 11:15 PM, Akhil Das ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote: You can add an internal ip to public hostname mapping in your /etc/hosts file, if your forwarding is proper then it wouldn't be a problem there after. Thanks Best Regards On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com mailto:anny9...@gmail.com wrote: Hi, For security reasons, we added a server between my aws Spark Cluster and local, so I couldn't connect to the cluster directly. To see the SparkUI and its related work's stdout and stderr, I used dynamic forwarding and configured the SOCKS proxy. Now I could see the SparkUI using the internal ec2 ip, however when I click on the application UI (4040) or the worker's UI (8081), it still automatically uses the public DNS instead of internal ec2 ip, which the browser now couldn't show. Is there a way that I could configure this? I saw that one could configure the LOCAL_ADDRESS_IP in the spark-env.sh, but not sure whether this could help. Does anyone experience the same issue? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-SparkUI-to-use-internal-ec2-ip-tp22311.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Spark-submit and multiple files
I tried your program in yarn-client mode and it worked with no exception. This is the command I used: spark-submit --master yarn-client --py-files work.py main.py (Spark 1.2.1) On 20.3.2015. 9:47, Guillaume Charhon wrote: Hi Davies, I am already using --py-files. The system does use the other file. The error I am getting is not trivial. Please check the error log. On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com mailto:dav...@databricks.com wrote: You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com mailto:guilla...@databerries.com wrote: Hello guys, I am having a hard time to understand how spark-submit behave with multiple files. I have created two code snippets. Each code snippet is composed of a main.py and work.py. The code works if I paste work.py then main.py in a pyspark shell. However both snippets do not work when using spark submit and generate different errors. Function add_1 definition outside http://www.codeshare.io/4ao8B https://justpaste.it/jzvj Embedded add_1 function definition http://www.codeshare.io/OQJxq https://justpaste.it/jzvn I am trying a way to make it work. Thank you for your support. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-and-multiple-files-tp22097.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Hamburg Apache Spark Meetup
Please add the Zagreb Meetup group, too. http://www.meetup.com/Apache-Spark-Zagreb-Meetup/ Thanks! On 18.2.2015. 19:46, Johan Beisser wrote: If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it. http://www.meetup.com/Hamburg-Apache-Spark-Meetup/ On Tue, Feb 17, 2015 at 5:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks! I've added you. Matei On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, there is a small Spark Meetup group in Berlin, Germany :-) http://www.meetup.com/Berlin-Apache-Spark-Meetup/ Plaes add this group to the Meetups list at https://spark.apache.org/community.html Ralph - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell
I believe your class needs to be defined as a case class (as I answered on SO).. On 25.2.2015. 5:15, anamika gupta wrote: Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema? On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote: Did you happen to have a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Thanks Best Regards On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com mailto:anamika.guo...@gmail.com wrote: My issue is posted here on stack-overflow. What am I doing wrong here? http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi View this message in context: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell http://apache-spark-user-list.1001560.n3.nabble.com/Facing-error-while-extending-scala-class-with-Product-interface-to-overcome-limit-of-22-fields-in-spl-tp21787.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Accumulator in SparkUI for streaming
Interesting. Accumulators are shown on Web UI if you are using the ordinary SparkContext (Spark 1.2). It just has to be named (and that's what you did). scala val acc = sc.accumulator(0, test accumulator) acc: org.apache.spark.Accumulator[Int] = 0 scala val rdd = sc.parallelize(1 to 1000) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala rdd.foreach(x = acc += 1) scala acc.value res1: Int = 1000 The Stage details page shows: On 20.2.2015. 9:25, Tim Smith wrote: On Spark 1.2: I am trying to capture # records read from a kafka topic: val inRecords = ssc.sparkContext.accumulator(0, InRecords) .. kInStreams.foreach( k = { k.foreachRDD ( rdd = inRecords += rdd.count().toInt ) inRecords.value Question is how do I get the accumulator to show up in the UI? I tried inRecords.value but that didn't help. Pretty sure it isn't showing up in Stage metrics. What's the trick here? collect? Thanks, Tim
Re: Posting to the list
The message went through after all. Sorry for spamming. On 21.2.2015. 21:27, pzecevic wrote: Hi Spark users. Does anybody know what are the steps required to be able to post to this list by sending an email to user@spark.apache.org? I just sent a reply to Corey Nolet's mail Missing shuffle files but I don't think it was accepted by the engine. If I look at the Spark user list, I don't see this topic (Missing shuffle files) at all: http://apache-spark-user-list.1001560.n3.nabble.com/ I can see it in the archives, though: https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/browser but my answer is not there. This is not the first time this happened and I am wondering what is going on. The engine is eating my emails? It doesn't like me? I am subscribed to the list and I have the Nabble account. I previously saw one of my email marked with This message has not been accepted by the mailing list yet. I read what that means, but I don't think it applies to me. What am I missing? P.S.: I am posting this through the Nabble web interface. Hope it gets through... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Missing shuffle files
Could you try to turn on the external shuffle service? spark.shuffle.service.enable= true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a single executor was getting a single or a couple large partitions but shouldn't the disk persistence kick in at that point? On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com mailto:arp...@spotify.com wrote: For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size is reduced, the problem disappears. I'm running a build from Spark master post 1.2 (build at 2015-01-16) and running on Yarn 2.2. Any idea of how to resolve this problem? User class threw exception: Job aborted due to stage failure: Task 450 in stage 450.1 failed 4 times, most recent failure: Lost task 450.3 in stage 450.1 (TID 167370, lon4-hadoopslave-b77.lon4.spotify.net http://lon4-hadoopslave-b77.lon4.spotify.net): java.io.FileNotFoundException: /disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) TIA, Anders
Re: Where can I find logs set inside RDD processing functions?
You can enable YARN log aggregation (yarn.log-aggregation-enable to true) and execute command yarn logs -applicationId your_application_id after your application finishes. Or you can look at them directly in HDFS in /tmp/logs/user/logs/applicationid/hostname On 6.2.2015. 19:50, nitinkak001 wrote: I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking URL. Where can I find the logs? /var outputRDD = partitionedRDD.mapPartitions(p = { val outputList = new ArrayList[scala.Tuple3[Long, Long, Int]] p.map({ case(key, value) = { log.info(Inside map) println(Inside map); for(i - 0 until outputTuples.size()){ val outputRecord = outputTuples.get(i) if(outputRecord != null){ outputList.add(outputRecord.getCurrRecordProfileID(), outputRecord.getWindowRecordProfileID, outputRecord.getScore()) } } } }) outputList.iterator() })/ Here is my log4j.properties /log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Settings to quiet third party logs that are too verbose log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-logs-set-inside-RDD-processing-functions-tp21537.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: LeaseExpiredException while writing schemardd to hdfs
Why don't you just map rdd's rows to lines and then call saveAsTextFile()? On 3.2.2015. 11:15, Hafiz Mujadid wrote: I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder DFSClient_NONMAPREDUCE_-564238432_57 does not have any open files here is my code rdd.foreachPartition( iterator = { var output = new Path( outputpath ) val fs = FileSystem.get( new Configuration() ) var writer : BufferedWriter = null writer = new BufferedWriter( new OutputStreamWriter( fs.create( output ) ) ) var line = new StringBuilder iterator.foreach( row = { row.foreach( column = { line.append( column.toString + splitter ) } ) writer.write( line.toString.dropRight( 1 ) ) writer.newLine() line.clear } ) writer.close() } ) I think problem is that I am making writer for each partition and multiple writer are executing in parallel so when they try to write to same file then this problem appears. When I avoid this approach then I face task not serializable exception Any suggest to handle this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LeaseExpiredException-while-writing-schemardd-to-hdfs-tp21477.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discourse: A proposed alternative to the Spark User list
Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the |apache-spark| and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack Exchange dedicated to Spark, either now or in the near future. Also, I doubt a Spark-specific site can pass the 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq: * Almost all Spark questions are on-topic for Stack Overflow * Stack Overflow already exists, it already has a tag for Spark, and nobody is complaining * You’re not creating such a big group that you don’t have enough experts to answer all possible questions * There’s a high probability that users of Stack Overflow would enjoy seeing the occasional question about Spark I think complaining won’t be sufficient. :) Someone expressed a concern that they won’t allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu… The communities for these projects are many, many times larger than the Spark community is or likely ever will be, simply due to the nature of the problems they are solving. What we need is an improvement to this mailing list. We need better tooling than Nabble to sit on top of the Apache archives, and we also need some way to control the volume and quality of mail on the list so that it remains a useful resource for the majority of users. Nick On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could be frozen (not accepting new questions, if that's possible) and redirect users to Stack Overflow (automatic reply?). Old questions remain (and are searchable) on Nabble, new questions go to Stack Exchange, so no need for migration. That's the idea, at least, as I'm not sure if that's technically doable... Is it? dev mailing list could perhaps stay on Nabble (it's not that busy), or have a special tag on Stack Exchange. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. There is a FAQ about creating new sites: http://area51.stackexchange.com/faq It says: Stack Exchange sites are free to create and free to use. All we ask is that you have an enthusiastic, committed group of expert users who check in regularly, asking and answering questions. I think this requirement is satisfied... Someone expressed a concern that they won't allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu... Later, though, the FAQ also says: If Y already exists, it already has a tag for X, and nobody is complaining (then you should not create a new
Re: Discourse: A proposed alternative to the Spark User list
But voting is done on dev list, right? That could stay there... Overlay might be a fine solution, too, but that still gives two user lists (SO and Nabble+overlay). On 22.1.2015. 10:42, Sean Owen wrote: Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask questions too. So I agree with Nick's take. BTW are there perhaps plans to split this mailing list into subproject-specific lists? That might also help tune in/out the subset of conversations of interest. On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the |apache-spark| and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack Exchange dedicated to Spark, either now or in the near future. Also, I doubt a Spark-specific site can pass the 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq: * Almost all Spark questions are on-topic for Stack Overflow * Stack Overflow already exists, it already has a tag for Spark, and nobody is complaining * You’re not creating such a big group that you don’t have enough experts to answer all possible questions * There’s a high probability that users of Stack Overflow would enjoy seeing the occasional question about Spark I think complaining won’t be sufficient. :) Someone expressed a concern that they won’t allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu… The communities for these projects are many, many times larger than the Spark community is or likely ever will be, simply due to the nature of the problems they are solving. What we need is an improvement to this mailing list. We need better tooling than Nabble to sit on top of the Apache archives, and we also need some way to control the volume and quality of mail on the list so that it remains a useful resource for the majority of users. Nick On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could be frozen (not accepting new questions, if that's possible) and redirect users to Stack Overflow (automatic reply?). Old questions remain (and are searchable) on Nabble, new questions go to Stack Exchange, so no need