from:"Petar Zecevic"

[jira] [Resolved] (SPARK-26826) Array indexing functions array_allpositions and array_select

2019-02-07 Thread Petar Zecevic (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic resolved SPARK-26826.
---
Resolution: Won't Fix

> Array indexing functions array_allpositions and array_select
> 
>
> Key: SPARK-26826
> URL: https://issues.apache.org/jira/browse/SPARK-26826
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>    Reporter: Petar Zecevic
>Priority: Major
>
> This ticket proposes two extra array functions: {{array_allpositions}} (named 
> after {{array_position}}) and {{array_select}}. These functions should make 
> it easier to:
> * get an array of indices of all occurences of a value in an array 
> ({{array_allpositions}})
> * select all elements of an array based on an array of indices 
> ({{array_select}})
> Although higher-order functions, such as {{aggregate}} and {{transform}}, 
> have been recently added, performing tasks above is still not simple, hence 
> this addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26826) Array indexing functions array_allpositions and array_select

2019-02-05 Thread Petar Zecevic (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-26826:
--
Description: 
This ticket proposes two extra array functions: {{array_allpositions}} (named 
after {{array_position}}) and {{array_select}}. These functions should make it 
easier to:
* get an array of indices of all occurences of a value in an array 
({{array_allpositions}})
* select all elements of an array based on an array of indices 
({{array_select}})

Although higher-order functions, such as {{aggregate}} and {{transform}}, have 
been recently added, performing tasks above is still not simple, hence this 
addition.

  was:
This ticket proposes two extra array functions: `array_allpositions` (named 
after `array_position`) and `array_select`. These functions should make it 
easier to:
* get an array of indices of all occurences of a value in an array 
(`array_allpositions`)
* select all elements of an array based on an array of indices (`array_select`)

Although higher-order functions, such as `aggregate` and `transform`, have been 
recently added, performing tasks above is still not simple, hence this addition.


> Array indexing functions array_allpositions and array_select
> 
>
> Key: SPARK-26826
> URL: https://issues.apache.org/jira/browse/SPARK-26826
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>    Reporter: Petar Zecevic
>Priority: Major
>
> This ticket proposes two extra array functions: {{array_allpositions}} (named 
> after {{array_position}}) and {{array_select}}. These functions should make 
> it easier to:
> * get an array of indices of all occurences of a value in an array 
> ({{array_allpositions}})
> * select all elements of an array based on an array of indices 
> ({{array_select}})
> Although higher-order functions, such as {{aggregate}} and {{transform}}, 
> have been recently added, performing tasks above is still not simple, hence 
> this addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26826) Array indexing functions array_allpositions and array_select

2019-02-05 Thread Petar Zecevic (JIRA)

Petar Zecevic created SPARK-26826:
-

 Summary: Array indexing functions array_allpositions and 
array_select
 Key: SPARK-26826
 URL: https://issues.apache.org/jira/browse/SPARK-26826
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Petar Zecevic


This ticket proposes two extra array functions: `array_allpositions` (named 
after `array_position`) and `array_select`. These functions should make it 
easier to:
* get an array of indices of all occurences of a value in an array 
(`array_allpositions`)
* select all elements of an array based on an array of indices (`array_select`)

Although higher-order functions, such as `aggregate` and `transform`, have been 
recently added, performing tasks above is still not simple, hence this addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Jenkins build errors

2018-06-29 Thread petar . zecevic



The problem was with the changes upstream. fetch upstream and a rebase resolved 
it and now the build is passing.

I also added a design doc and made the JIRA description a bit clearer 
(https://issues.apache.org/jira/browse/SPARK-24020) so I hope it will get 
merged soon.

Thanks,
Petar


Sean Owen @ 1970-01-01 01:00 CET:

> Also confused about this one as many builds succeed. One possible difference 
> is that this failure is in the Hive tests, so are you building and testing 
> with -Phive locally where it works? still does not explain the download 
> failure. It could be a mirror
> problem, throttling, etc. But there again haven't spotted another failing 
> Hive test.
>
> On Wed, Jun 20, 2018 at 1:55 AM Petar Zecevic  wrote:
>
>  It's still dying. Back to this error (it used to be spark-2.2.0 before):
>
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory
>  So, a mirror is missing that Spark version... I don't understand why nobody 
> else has these errors and I get them every time without fail.
>
>  Petar


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Petar Zecevic updated SPARK-24020:
--
Description:
The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted within partitions by Y column and you
need to calculate an expensive function on the joined rows, which reduces the
number of output rows (e.g. condition based on a spatial distance calculation).
But you could theoretically reduce the number of joined rows for which the
calculation itself is performed by using a range condition on the Y column.
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, Spark
will first cross-join all the rows with the same X value and only then try to
apply the range condition and any function calculations. This happens because,
inside the generated sort-merge join (SMJ) code, these extra conditions are put
in the same block with the function being calculated and there is no way to
evaluate these conditions before reading all the rows to be checked into memory
(into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large
number of rows per X, this can result in a huge number of calculations and a
huge number of rows in executor memory, which can be unfeasible.
h3. The solution implementation

We therefore propose a change to the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving
window through the values from the right relation as the left row changes. You
could call this a combination of an equi-join and a theta join; in literature
it is sometimes called an “epsilon join”. We call it a "sort-merge inner range
join".
This design uses much less memory (not all rows with the same values of X need
to be loaded into memory at once) and requires a much lower number of
comparisons (the validity of this statement depends on the actual data and
conditions used).
h3. The classes that need to be changed

For implementing the described change we propose changes to these classes:
* _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to
recognize the case where a simple range condition with lower and upper limits
is used on a secondary column (a column not included in the equi-join
condition). The pattern also needs to extract the information later required
for code generation etc.
* _InMemoryUnsafeRowQueue_ – the moving window implementation to be used
instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be
removed and added to/from the structure as the left key (X) changes, or the
left secondary value (Y) changes, so the structure needs to be a queue. To make
the change as less intrusive as possible, we propose to implement
_InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_
* _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be
aware of the extracted range conditions
* _SortMergeJoinExec_ – the main implementation of the optimization. Needs to
support two code paths:
** when whole-stage code generation is turned off (method doExecute, which uses
sortMergeJoinInnerRangeScanner)
** when whole-stage code generation is turned on (methods doProduce and
genScanner)
* _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range
optimization in the case when whole-stage codegen is turned off
* _InnerJoinSuite_ – functional tests
* _JoinBenchmark_ – performance tests

h3. Triggering the optimization

The optimization should be triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column are
specified. If the tables aren't sorted by both columns, appropriate sorts
should be added.

To limit the impact of this change we also propose adding a new parameter
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could
be used to switch off the optimization entirely.

h2. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance
calculations.

was:
The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted within partitions by Y column and you
need to calculate an expensive function on the joined rows, which reduces the
number of output rows (e.g. condition based on a spatial distance calculation).
But you could theoretically reduce the number of joined rows for which the
calculation itself is performed by using a range condition on the Y column.
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

h3. Triggering the optimization

h3. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance
calculations.

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

h2. Triggering the optimization

h2. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance
calculations.

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

Potential use-cases for this are joins based on spatial or temporal distance
calculations.

was:
The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions) and
you need to calculate an expensive function on the joined rows. During a
sort-merge join, Spark will do cross-joins of all rows that have the same X
values and calculate the function's value on all of them. If the two tables
have a large number of rows per X, this can result in a huge number of
calculations.

We hereby propose an optimization that would allow you to reduce the number of
matching rows per X using a range condition on Y columns of the two tables.
Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Attachment: SMJ-innerRange-PR24020-designDoc.pdf

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>    Reporter: Petar Zecevic
>Priority: Major
> Attachments: SMJ-innerRange-PR24020-designDoc.pdf
>
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Jenkins build errors

2018-06-20 Thread Petar Zecevic



It's still dying. Back to this error (it used to be spark-2.2.0 before):

java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
"/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory

So, a mirror is missing that Spark version... I don't understand why 
nobody else has these errors and I get them every time without fail.


Petar


Le 6/19/2018 à 2:35 PM, Sean Owen a écrit :
Those still appear to be env problems. I don't know why it is so 
persistent. Does it all pass locally? Retrigger tests again and see 
what happens.


On Tue, Jun 19, 2018, 2:53 AM Petar Zecevic <mailto:petar.zece...@gmail.com>> wrote:



Thanks, but unfortunately, it died again. Now at pyspark tests:



Running PySpark tests

Running PySpark tests. Output is in 
/home/jenkins/workspace/SparkPullRequestBuilder@2/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 
'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will skip Pandas related features against Python executable 'python2.7' in 
'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas 0.16.0 was 
found.
Will test PyArrow related features against Python executable 'python3.4' in 
'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3.4' in 
'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will skip Pandas related features against Python executable 'pypy' in 
'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas was not 
found.
Starting test(python2.7): pyspark.mllib.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
...
[Stage 0:>  (0 + 1) 
/ 1]
 
..

[Stage 0:>  (0 + 4) 
/ 4]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
..

[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>(0 + 32) 
/ 32]...
[Stage 10:> (0 + 1) 
/ 1]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
.s

[Stage 0:>  (0 + 1) 
/ 1]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
[Stage 0:==>(1 + 3) 
/ 4]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
..

[Stage 0:>  (0 + 2) 
/ 2]

Re: Jenkins build errors

2018-06-19 Thread Petar Zecevic

/0.1/mylib-0.1.pom

  -- artifact a#mylib;0.1!mylib.jar:

https://repo1.maven.org/maven2/a/mylib/0.1/mylib-0.1.jar

 spark-packages: tried

http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.pom

  -- artifact a#mylib;0.1!mylib.jar:

http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.jar

 repo-1: tried

  file:/tmp/tmpgO7AIY/a/mylib/0.1/mylib-0.1.pom

::

::  UNRESOLVED DEPENDENCIES ::

::

:: a#mylib;0.1: not found

::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
a#mylib;0.1: not found]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1268)
at 
org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Ffile:/tmp/tmpwtN2z_ added as a remote repository with the name: repo-1
Ivy Default Cache set to: /home/jenkins/.ivy2/cache
The jars for the packages stored in: /home/jenkins/.ivy2/jars
:: loading settings :: url = 
jar:file:/home/jenkins/workspace/SparkPullRequestBuilder@2/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
a#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found a#mylib;0.1 in repo-1
:: resolution report :: resolve 1378ms :: artifacts dl 4ms
:: modules in use:
a#mylib;0.1 from repo-1 in [default]
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|  default |   1   |   1   |   1   |   0   ||   1   |   0   |
-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/8ms)
.
[Stage 0:>  (0 + 4) / 4]

.

[Stage 0:>  (0 + 1) / 1]

.

[Stage 0:>  (0 + 1) / 1]

...

[Stage 0:> (0 + 4) / 20]
[Stage 0:=>(6 + 4) / 20]

..

==
FAIL: test_package_dependency (pyspark.tests.SparkSubmitTests)
Submit and test a script with a dependency on a Spark Package
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 2093, in test_package_dependency
self.assertEqual(0, proc.returncode)
AssertionError: 0 != 1

--
Ran 127 tests in 205.547s

FAILED (failures=1, skipped=2)
NOTE: Skipping SciPy tests as it does not seem to be installed
NOTE: Skipping NumPy tests as it does not seem to be installed
   Random listing order was used

Had test failures in pyspark.tests with pypy; see logs.
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder@2/python/run-tests 
--parallelism=4 ; received return code 255
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92038/
Test FAILed.
Finished: FAILURE


Le 6/18/2018 à 8:05 PM, shane knapp a écrit :
i triggered another build against your PR, so let's see if this 
happens again or was a transient failure.


https://amplab.cs.berkeley.edu/jenkins/jo

Jenkins build errors

2018-06-18 Thread Petar Zecevic


Hi,
Jenkins build for my PR (https://github.com/apache/spark/pull/21109 ; 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92023/testReport/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/) 
keeps failing. First it couldn't download Spark v.2.2.0 (indeed, it 
wasn't available at the mirror it selected), now it's failing with this 
exception below.


Can someone explain these errors for me? Is anybody else experiencing 
similar problems?


Thanks,
Petar


Error Message

java.io.IOException: Cannot run program "./bin/spark-submit" (in 
directory "/tmp/test-spark/spark-2.2.1"): error=2, No such file or 
directory


Stacktrace

sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
"./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): 
error=2, No such file or directory

    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at 
org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)

    at scala.collection.immutable.List.foreach(List.scala:381)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
    at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
    at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)

    at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
    at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
    at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)

    at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such 
file or directory

    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:248)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 17 more

Re: Sort-merge join improvement

2018-05-22 Thread Petar Zecevic


Hi,

we went through a round of reviews on this PR. Performance improvements 
can be substantial and there are unit and performance tests included.


One remark was that the amount of changed code is large but I don't see 
how to reduce it and still keep the performance improvements. Besides, 
all the new code is well contained in separate classes (unless it was 
necessary to change existing ones).


So I believe this is ready to be merged.

Can some of the committers please take another look at this and accept 
the PR?


Thank you,

Petar Zecevic


Le 5/15/2018 à 10:55 AM, Petar Zecevic a écrit :

Based on some reviews I put additional effort into fixing the case when
wholestage codegen is turned off.

Sort-merge join with additional range conditions is now 10x faster (can
be more or less, depending on exact use-case) in both cases - with
wholestage turned off or on - compared to non-optimized SMJ.

Merging this would help us tremendously and I believe this can be useful
in other applications, too.

Can you please review (https://github.com/apache/spark/pull/21109) and
merge the patch?

Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Sort-merge join improvement

2018-05-15 Thread Petar Zecevic

Based on some reviews I put additional effort into fixing the case when 
wholestage codegen is turned off.


Sort-merge join with additional range conditions is now 10x faster (can 
be more or less, depending on exact use-case) in both cases - with 
wholestage turned off or on - compared to non-optimized SMJ.


Merging this would help us tremendously and I believe this can be useful 
in other applications, too.


Can you please review (https://github.com/apache/spark/pull/21109) and 
merge the patch?


Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Sort-merge join improvement

2018-04-23 Thread Petar Zecevic


Hi,

the PR tests completed successfully 
(https://github.com/apache/spark/pull/21109).


Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization

2018-04-19 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444335#comment-16444335
 ] 

Petar Zecevic commented on SPARK-24020:
---

No, this implementation only applies to equi-joins that have range conditions 
on different columns. You can think of it as an equi-join with "sub-band" 
conditions. Hence the name we gave it ("sort-merge inner range join").

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Sort-merge join improvement

2018-04-18 Thread Petar Zecevic


As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Created] (SPARK-24020) Sort-merge join inner range optimization

2018-04-18 Thread Petar Zecevic (JIRA)

Petar Zecevic created SPARK-24020:
-

 Summary: Sort-merge join inner range optimization
 Key: SPARK-24020
 URL: https://issues.apache.org/jira/browse/SPARK-24020
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Petar Zecevic


The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted by Y column (within partitions) and 
you need to calculate an expensive function on the joined rows. During a 
sort-merge join, Spark will do cross-joins of all rows that have the same X 
values and calculate the function's value on all of them. If the two tables 
have a large number of rows per X, this can result in a huge number of 
calculations.

We hereby propose an optimization that would allow you to reduce the number of 
matching rows per X using a range condition on Y columns of the two tables. 
Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no influence 
on the number of rows (per X) being checked because these extra conditions are 
put in the same block with the function being calculated.

Here we propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a moving 
window across the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join (we call it 
"sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Sort-merge join improvement

2018-04-17 Thread Petar Zecevic


Hello everybody

We (at University of Zagreb and University of Washington) have 
implemented an optimization of Spark's sort-merge join (SMJ) which has 
improved performance of our jobs considerably and we would like to know 
if Spark community thinks it would be useful to include this in the main 
distribution.


The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted by Y column (within partitions) 
and you need to calculate an expensive function on the joined rows. 
During a sort-merge join, Spark will do cross-joins of all rows that 
have the same X values and calculate the function's value on all of 
them. If the two tables have a large number of rows per X, this can 
result in a huge number of calculations.


Our optimization allows you to reduce the number of matching rows per X 
using a range condition on Y columns of the two tables. Something like:


... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no 
influence on the number of rows (per X) being checked because these 
extra conditions are put in the same block with the function being 
calculated.


Our optimization changes the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a 
moving window across the values from the right relation as the left row 
changes. You could call this a combination of an equi-join and a theta 
join (we call it "sort-merge inner range join").


Potential use-cases for this are joins based on spatial or temporal 
distance calculations.


The optimization is triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column 
are specified. If the tables aren't sorted by both columns, appropriate 
sorts will be added.



We have several questions:

1. Do you see any other way to optimize queries like these (eliminate 
unnecessary calculations) without changing the sort-merge join algorithm?


2. We believe there is a more general pattern here and that this could 
help in other similar situations where secondary sorting is available. 
Would you agree?


3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-03-19 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200028#comment-15200028
 ] 

Petar Zecevic commented on SPARK-13313:
---

Ok, thanks for reporting. I'll look into this. 

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>    Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147007#comment-15147007
 ] 

Petar Zecevic commented on SPARK-13313:
---

No, I don't think it's got anything to do with that. That largest SCC's 
vertices are not connected in any way and they shouldn't be in the same group.

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>    Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146731#comment-15146731
 ] 

Petar Zecevic commented on SPARK-13313:
---

Yes, you need articles.tsv and links.tsv from this archive: 
http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz

Then parse the data, assign IDs to article names and create the graph:
val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" 
&& !line.startsWith("#")).zipWithIndex().cache()
val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && 
!line.startsWith("#"))
val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) 
}).join(articles).map(x => x._2).join(articles).map(x => x._2)
val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0)

Then get strongly connected components:
val wikiSCC = wikigraph.stronglyConnectedComponents(100)

wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC 
in wikiSCC has 4051 vertices and that's obviously wrong.

The change in line 89, which I mentioned, seems to solve this problem, but then 
other issues arise (stack overflow etc) and I don't have time to investigate 
further. I hope someone will look into this.



> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-02-14 Thread Petar Zecevic (JIRA)

Petar Zecevic created SPARK-13313:
-

 Summary: Strongly connected components doesn't find all strongly 
connected components
 Key: SPARK-13313
 URL: https://issues.apache.org/jira/browse/SPARK-13313
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.6.0
Reporter: Petar Zecevic


Strongly connected components algorithm doesn't find all strongly connected 
components. I was using Wikispeedia dataset 
(http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
SCCs and one of them had 4051 vertices, which in reality don't have any edges 
between them. 
I think the problem could be on line 89 of StronglyConnectedComponents.scala 
file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
the second Pregel call should use Out edge direction, the same as the first 
call because the direction is reversed in the provided sendMsg function 
(message is sent to source vertex and not destination vertex).
If that is changed (line 89), the algorithm starts finding much more SCCs, but 
eventually stack overflow exception occurs. I believe graph objects that are 
changed through iterations should not be cached, but checkpointed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic



You can try out a few tricks employed by folks at Lynx Analytics... 
Daniel Darabos gave some details at Spark Summit:

https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs


On 22.7.2015. 17:00, Louis Hust wrote:

My code like below:
MapString, String t11opt = new HashMapString, String();
t11opt.put(url, DB_URL);
t11opt.put(dbtable, t11);
DataFrame t11 = sqlContext.load(jdbc, t11opt);
t11.registerTempTable(t11);

...the same for t12, t21, t22


DataFrame t1 = t11.unionAll(t12);
t1.registerTempTable(t1);
DataFrame t2 = t21.unionAll(t22);
t2.registerTempTable(t2);
for (int i = 0; i  10; i ++) {
System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql(select txt from 
t1 join t2 on t1.id http://t1.id = t2.id http://t2.id);

crossjoin.show();
System.out.println(new Date(System.currentTimeMillis()));
}

Where t11,t12, t21,t22 are all table dataframe load from jdbc  of 
mysql database which is at local with the spark job.


But each loop execute about 3 seconds. i do not know why cost so many 
time?





2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk 
mailto:robin.e...@xense.co.uk:


Here’s an example using spark-shell on my laptop:

sc.textFile(LICENSE).filter(_ contains Spark).count

This takes less than a second the first time I run it and is
instantaneous on every subsequent run.

What code are you running?



On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

I do a simple test using spark in standalone mode(not cluster),
 and found a simple action take a few seconds, the data size is
small, just few rows.
So each spark job will cost some time for init or prepare work no
matter what the job is?
I mean if the basic framework of spark job will cost seconds?

2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk
mailto:robin.e...@xense.co.uk:

Real-time is, of course, relative but you’ve mentioned
microsecond level. Spark is designed to process large amounts
of data in a distributed fashion. No distributed system I
know of could give any kind of guarantees at the microsecond
level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

 Hi, all

 I am using spark jar in standalone mode, fetch data from
different mysql instance and do some action, but i found the
time is at second level.

 So i want to know if spark job is suitable for real time
query which at microseconds?

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic



Sorry about self-promotion, but there's a really nice tutorial for 
setting up Eclipse for Spark in Spark in Action book:

http://www.manning.com/bonaci/


On 27.7.2015. 10:22, Akhil Das wrote:
You can follow this doc 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup


Thanks
Best Regards

On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com 
mailto:ksiv...@gmail.com wrote:


Hi All,

I am trying to setup the Eclipse (LUNA)  with Maven so that I
create a
maven projects for developing spark programs.  I am having some
issues and I
am not sure what is the issue.


  Can Anyone share a nice step-step document to configure eclipse
with maven
for spark development.


Thanks
Siva



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic



Sorry about self-promotion, but there's a really nice tutorial for 
setting up Eclipse for Spark in Spark in Action book:

http://www.manning.com/bonaci/


On 24.7.2015. 7:26, Siva Reddy wrote:

Hi All,

 I am trying to setup the Eclipse (LUNA)  with Maven so that I create a
maven projects for developing spark programs.  I am having some issues and I
am not sure what is the issue.


   Can Anyone share a nice step-step document to configure eclipse with maven
for spark development.


Thanks
Siva



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic



You can try out a few tricks employed by folks at Lynx Analytics... 
Daniel Darabos gave some details at Spark Summit:

https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs


On 22.7.2015. 17:00, Louis Hust wrote:

My code like below:
MapString, String t11opt = new HashMapString, String();
t11opt.put(url, DB_URL);
t11opt.put(dbtable, t11);
DataFrame t11 = sqlContext.load(jdbc, t11opt);
t11.registerTempTable(t11);

...the same for t12, t21, t22


DataFrame t1 = t11.unionAll(t12);
t1.registerTempTable(t1);
DataFrame t2 = t21.unionAll(t22);
t2.registerTempTable(t2);
for (int i = 0; i  10; i ++) {
System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql(select txt from 
t1 join t2 on t1.id http://t1.id = t2.id http://t2.id);

crossjoin.show();
System.out.println(new Date(System.currentTimeMillis()));
}

Where t11,t12, t21,t22 are all table dataframe load from jdbc  of 
mysql database which is at local with the spark job.


But each loop execute about 3 seconds. i do not know why cost so many 
time?





2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk 
mailto:robin.e...@xense.co.uk:


Here’s an example using spark-shell on my laptop:

sc.textFile(LICENSE).filter(_ contains Spark).count

This takes less than a second the first time I run it and is
instantaneous on every subsequent run.

What code are you running?



On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

I do a simple test using spark in standalone mode(not cluster),
 and found a simple action take a few seconds, the data size is
small, just few rows.
So each spark job will cost some time for init or prepare work no
matter what the job is?
I mean if the basic framework of spark job will cost seconds?

2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk
mailto:robin.e...@xense.co.uk:

Real-time is, of course, relative but you’ve mentioned
microsecond level. Spark is designed to process large amounts
of data in a distributed fashion. No distributed system I
know of could give any kind of guarantees at the microsecond
level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

 Hi, all

 I am using spark jar in standalone mode, fetch data from
different mysql instance and do some action, but i found the
time is at second level.

 So i want to know if spark job is suitable for real time
query which at microseconds?

Re: Fwd: Model weights of linear regression becomes abnormal values

2015-05-29 Thread Petar Zecevic



You probably need to scale the values in the data set so that they are 
all of comparable ranges and translate them so that their means get to 0.


You can use pyspark.mllib.feature.StandardScaler(True, True) object for 
that.


On 28.5.2015. 6:08, Maheshakya Wijewardena wrote:


Hi,

I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with 
the attached dataset. The code is attached. When I check the model 
weights vector after training, it contains `nan` values.

[nan,nan,nan,nan,nan,nan,nan,nan]
But for some data sets, this problem does not occur. What might be the reason 
for this?
Is this an issue with the data I'm using or a bug?
Best regards.
--
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 Lanka (Pvt) Ltd
Email: mahesha...@wso2.com mailto:mahesha...@wso2.com
Mobile: +94711228855/*
*/




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390206#comment-14390206
 ] 

Petar Zecevic commented on SPARK-6646:
--

Good one :)

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Petar Zecevic



Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh?


On 31.3.2015. 19:19, Anny Chen wrote:

Hi Akhil,

I tried editing the /etc/hosts on the master and on the workers, and 
seems it is not working for me.


I tried adding hostname internal-ip and it didn't work. I then 
tried adding internal-ip hostname and it didn't work either. I 
guess I should also edit the spark-env.sh file?


Thanks!
Anny

On Mon, Mar 30, 2015 at 11:15 PM, Akhil Das 
ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote:


You can add an internal ip to public hostname mapping in your
/etc/hosts file, if your forwarding is proper then it wouldn't be
a problem there after.



Thanks
Best Regards

On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com
mailto:anny9...@gmail.com wrote:

Hi,

For security reasons, we added a server between my aws Spark
Cluster and
local, so I couldn't connect to the cluster directly. To see
the SparkUI and
its related work's  stdout and stderr, I used dynamic
forwarding and
configured the SOCKS proxy. Now I could see the SparkUI using
the  internal
ec2 ip, however when I click on the application UI (4040) or
the worker's UI
(8081), it still automatically uses the public DNS instead of
internal ec2
ip, which the browser now couldn't show.

Is there a way that I could configure this? I saw that one
could configure
the LOCAL_ADDRESS_IP in the spark-env.sh, but not sure whether
this could
help. Does anyone experience the same issue?

Thanks a lot!
Anny




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-SparkUI-to-use-internal-ec2-ip-tp22311.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Spark-submit and multiple files

2015-03-20 Thread Petar Zecevic



I tried your program in yarn-client mode and it worked with no 
exception. This is the command I used:


spark-submit --master yarn-client --py-files work.py main.py

(Spark 1.2.1)

On 20.3.2015. 9:47, Guillaume Charhon wrote:

Hi Davies,

I am already using --py-files. The system does use the other file. The 
error I am getting is not trivial. Please check the error log.




On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com 
mailto:dav...@databricks.com wrote:


You could submit additional Python source via --py-files , for
example:

$ bin/spark-submit --py-files work.py main.py

On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez
guilla...@databerries.com mailto:guilla...@databerries.com wrote:
 Hello guys,

 I am having a hard time to understand how spark-submit behave
with multiple
 files. I have created two code snippets. Each code snippet is
composed of a
 main.py and work.py. The code works if I paste work.py then
main.py in a
 pyspark shell. However both snippets do not work when using
spark submit and
 generate different errors.

 Function add_1 definition outside
 http://www.codeshare.io/4ao8B
 https://justpaste.it/jzvj

 Embedded add_1 function definition
 http://www.codeshare.io/OQJxq
 https://justpaste.it/jzvn

 I am trying a way to make it work.

 Thank you for your support.



 --
 View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-and-multiple-files-tp22097.html
 Sent from the Apache Spark User List mailing list archive at
Nabble.com.


-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Hamburg Apache Spark Meetup

2015-02-25 Thread Petar Zecevic



Please add the Zagreb Meetup group, too.

http://www.meetup.com/Apache-Spark-Zagreb-Meetup/

Thanks!

On 18.2.2015. 19:46, Johan Beisser wrote:

If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it.

http://www.meetup.com/Hamburg-Apache-Spark-Meetup/

On Tue, Feb 17, 2015 at 5:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

Thanks! I've added you.

Matei


On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu 
ra...@the4thfloor.eu wrote:

Hi,


there is a small Spark Meetup group in Berlin, Germany :-)
http://www.meetup.com/Berlin-Apache-Spark-Meetup/

Plaes add this group to the Meetups list at
https://spark.apache.org/community.html


Ralph

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Petar Zecevic



I believe your class needs to be defined as a case class (as I answered 
on SO)..



On 25.2.2015. 5:15, anamika gupta wrote:

Hi Akhil

I guess it skipped my attention. I would definitely give it a try.

While I would still like to know what is the issue with the way I have 
created schema?


On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com 
mailto:ak...@sigmoidanalytics.com wrote:


Did you happen to have a look at

https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Thanks
Best Regards

On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com
mailto:anamika.guo...@gmail.com wrote:

My issue is posted here on stack-overflow. What am I doing
wrong here?


http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi


View this message in context: Facing error while extending
scala class with Product interface to overcome limit of 22
fields in spark-shell

http://apache-spark-user-list.1001560.n3.nabble.com/Facing-error-while-extending-scala-class-with-Product-interface-to-overcome-limit-of-22-fields-in-spl-tp21787.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at
Nabble.com.

Re: Accumulator in SparkUI for streaming

2015-02-24 Thread Petar Zecevic



Interesting. Accumulators are shown on Web UI if you are using the 
ordinary SparkContext (Spark 1.2). It just has to be named (and that's 
what you did).


scala val acc = sc.accumulator(0, test accumulator)
acc: org.apache.spark.Accumulator[Int] = 0
scala val rdd = sc.parallelize(1 to 1000)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:12

scala rdd.foreach(x = acc += 1)
scala acc.value
res1: Int = 1000

The Stage details page shows:




On 20.2.2015. 9:25, Tim Smith wrote:

On Spark 1.2:

I am trying to capture # records read from a kafka topic:

val inRecords = ssc.sparkContext.accumulator(0, InRecords)

..

kInStreams.foreach( k =
{

 k.foreachRDD ( rdd =  inRecords += rdd.count().toInt  )
 inRecords.value


Question is how do I get the accumulator to show up in the UI? I tried 
inRecords.value but that didn't help. Pretty sure it isn't showing 
up in Stage metrics.


What's the trick here? collect?

Thanks,

Tim

Re: Posting to the list

2015-02-21 Thread Petar Zecevic



The message went through after all. Sorry for spamming.


On 21.2.2015. 21:27, pzecevic wrote:

Hi Spark users.

Does anybody know what are the steps required to be able to post to this
list by sending an email to user@spark.apache.org? I just sent a reply to
Corey Nolet's mail Missing shuffle files but I don't think it was accepted
by the engine.

If I look at the Spark user list, I don't see this topic (Missing shuffle
files) at all: http://apache-spark-user-list.1001560.n3.nabble.com/

I can see it in the archives, though:
https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/browser
but my answer is not there.

This is not the first time this happened and I am wondering what is going
on. The engine is eating my emails? It doesn't like me?
I am subscribed to the list and I have the Nabble account.
I previously saw one of my email marked with This message has not been
accepted by the mailing list yet. I read what that means, but I don't think
it applies to me.

What am I missing?

P.S.: I am posting this through the Nabble web interface. Hope it gets
through...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Missing shuffle files

2015-02-21 Thread Petar Zecevic



Could you try to turn on the external shuffle service?

spark.shuffle.service.enable= true


On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing 
that executors are being lost as well. Thing is, I can't figure out 
how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 
1.3TB of memory allocated for the application. I was thinking perhaps 
it was possible that a single executor was getting a single or a 
couple large partitions but shouldn't the disk persistence kick in at 
that point?


On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com 
mailto:arp...@spotify.com wrote:


For large jobs, the following error message is shown that seems to
indicate that shuffle files for some reason are missing. It's a
rather large job with many partitions. If the data size is
reduced, the problem disappears. I'm running a build from Spark
master post 1.2 (build at 2015-01-16) and running on Yarn 2.2. Any
idea of how to resolve this problem?

User class threw exception: Job aborted due to stage failure: Task
450 in stage 450.1 failed 4 times, most recent failure: Lost task
450.3 in stage 450.1 (TID 167370,
lon4-hadoopslave-b77.lon4.spotify.net
http://lon4-hadoopslave-b77.lon4.spotify.net):
java.io.FileNotFoundException:

/disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450
(No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:171)
at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)

at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

TIA,
Anders

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Petar Zecevic



You can enable YARN log aggregation (yarn.log-aggregation-enable to 
true) and execute command

yarn logs -applicationId your_application_id
after your application finishes.

Or you can look at them directly in HDFS in 
/tmp/logs/user/logs/applicationid/hostname


On 6.2.2015. 19:50, nitinkak001 wrote:

I am trying to debug my mapPartitionsFunction. Here is the code. There are
two ways I am trying to log using log.info() or println(). I am running in
yarn-cluster mode. While I can see the logs from driver code, I am not able
to see logs from map, mapPartition functions in the Application Tracking
URL. Where can I find the logs?

  /var outputRDD = partitionedRDD.mapPartitions(p = {
   val outputList = new ArrayList[scala.Tuple3[Long, Long, Int]]
   p.map({ case(key, value) = {
log.info(Inside map)
println(Inside map);
for(i - 0 until outputTuples.size()){
  val outputRecord = outputTuples.get(i)
  if(outputRecord != null){
outputList.add(outputRecord.getCurrRecordProfileID(),
outputRecord.getWindowRecordProfileID, outputRecord.getScore())
  }
}
 }
   })
   outputList.iterator()
 })/

Here is my log4j.properties

/log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO/




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-logs-set-inside-RDD-processing-functions-tp21537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LeaseExpiredException while writing schemardd to hdfs

2015-02-05 Thread Petar Zecevic



Why don't you just map rdd's rows to lines and then call saveAsTextFile()?

On 3.2.2015. 11:15, Hafiz Mujadid wrote:

I want to write whole schemardd to single in hdfs but facing following
exception

rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder
DFSClient_NONMAPREDUCE_-564238432_57 does not have any open files

here is my code
rdd.foreachPartition( iterator = {
   var output = new Path( outputpath )
   val fs = FileSystem.get( new Configuration() )
   var writer : BufferedWriter = null
   writer = new BufferedWriter( new OutputStreamWriter(  fs.create(
output ) ) )
   var line = new StringBuilder
   iterator.foreach( row = {
row.foreach( column = {
line.append( column.toString + splitter )
} )
writer.write( line.toString.dropRight( 1 ) )
writer.newLine()
line.clear
} )
writer.close()
} )

I think problem is that I am making writer for each partition and multiple
writer are executing in parallel so when they try to write to same file then
this problem appears.
When I avoid this approach then I face task not serializable exception

Any suggest to handle this problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LeaseExpiredException-while-writing-schemardd-to-hdfs-tp21477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic



Ok, thanks for the clarifications. I didn't know this list has to remain 
as the only official list.


Nabble is really not the best solution in the world, but we're stuck 
with it, I guess.


That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform. That
is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack
Overflow under the |apache-spark| and related tags. Stack Overflow
works quite well.
 3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We
want to improve the utility and user experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list
archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new questions, if
that’s possible) and redirect users to Stack Overflow (automatic
reply?).

From what I understand of the ASF’s policies, this is not possible. :( 
This mailing list must remain the official Spark user discussion platform.


Other thing, about new Stack Exchange site I proposed earlier. If
a new site is created, there is no problem with guidelines, I
think, because Spark community can apply different guidelines for
the new site.

I think Stack Overflow and the various Spark tags are working fine. I 
don’t see a compelling need for a Stack Exchange dedicated to Spark, 
either now or in the near future. Also, I doubt a Spark-specific site 
can pass the 4 tests in the Area 51 FAQ 
http://area51.stackexchange.com/faq:


  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for Spark, and
nobody is complaining
  * You’re not creating such a big group that you don’t have enough
experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger than 
the Spark community is or likely ever will be, simply due to the 
nature of the problems they are solving.


What we need is an improvement to this mailing list. We need better 
tooling than Nabble to sit on top of the Apache archives, and we also 
need some way to control the volume and quality of mail on the list so 
that it remains a useful resource for the majority of users.


Nick



On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:


Hi,
I tried to find the last reply by Nick Chammas (that I received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example of
Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new questions, if
that's
possible) and redirect users to Stack Overflow (automatic reply?). Old
questions remain (and are searchable) on Nabble, new questions go
to Stack
Exchange, so no need for migration. That's the idea, at least, as
I'm not
sure if that's technically doable... Is it?
dev mailing list could perhaps stay on Nabble (it's not that
busy), or have
a special tag on Stack Exchange.

Other thing, about new Stack Exchange site I proposed earlier. If
a new site
is created, there is no problem with guidelines, I think, because
Spark
community can apply different guidelines for the new site.

There is a FAQ about creating new sites:
http://area51.stackexchange.com/faq
It says: Stack Exchange sites are free to create and free to use.
All we
ask is that you have an enthusiastic, committed group of expert
users who
check in regularly, asking and answering questions.
I think this requirement is satisfied...
Someone expressed a concern that they won't allow creating a
project-specific site, but there already exist some
project-specific sites,
like Tor, Drupal, Ubuntu...

Later, though, the FAQ also says:
If Y already exists, it already has a tag for X, and nobody is
complaining
(then you should not create a new

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic



But voting is done on dev list, right? That could stay there...

Overlay might be a fine solution, too, but that still gives two user 
lists (SO and Nabble+overlay).



On 22.1.2015. 10:42, Sean Owen wrote:


Yes, there is some project business like votes of record on releases 
that needs to be carried on in standard, simple accessible place and 
SO is not at all suitable.


Nobody is stuck with Nabble. The suggestion is to enable a different 
overlay on the existing list. SO remains a place you can ask questions 
too. So I agree with Nick's take.


BTW are there perhaps plans to split this mailing list into 
subproject-specific lists? That might also help tune in/out the subset 
of conversations of interest.


On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:



Ok, thanks for the clarifications. I didn't know this list has to
remain as the only official list.

Nabble is really not the best solution in the world, but we're
stuck with it, I guess.

That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform.
That is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in
Stack Overflow under the |apache-spark| and related tags.
Stack Overflow works quite well.
 3. The ASF will not agree to deprecating or migrating this user
list to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and
discussions are hard to find or follow; discussion tooling is
also lacking. We want to improve the utility and user
experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF.
It offers a slightly better interface to the Apache mailing
list archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new
questions, if that’s possible) and redirect users to Stack
Overflow (automatic reply?).

From what I understand of the ASF’s policies, this is not
possible. :( This mailing list must remain the official Spark
user discussion platform.

Other thing, about new Stack Exchange site I proposed
earlier. If a new site is created, there is no problem with
guidelines, I think, because Spark community can apply
different guidelines for the new site.

I think Stack Overflow and the various Spark tags are working
fine. I don’t see a compelling need for a Stack Exchange
dedicated to Spark, either now or in the near future. Also, I
doubt a Spark-specific site can pass the 4 tests in the Area 51
FAQ http://area51.stackexchange.com/faq:

  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for
Spark, and nobody is complaining
  * You’re not creating such a big group that you don’t have
enough experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger
than the Spark community is or likely ever will be, simply due to
the nature of the problems they are solving.

What we need is an improvement to this mailing list. We need
better tooling than Nabble to sit on top of the Apache archives,
and we also need some way to control the volume and quality of
mail on the list so that it remains a useful resource for the
majority of users.

Nick



On Wed Jan 21 2015 at 3:13:21 PM pzecevic
petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote:

Hi,
I tried to find the last reply by Nick Chammas (that I
received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example
of Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new
questions, if that's
possible) and redirect users to Stack Overflow (automatic
reply?). Old
questions remain (and are searchable) on Nabble, new
questions go to Stack
Exchange, so no need

40 matches

Mail list logo