[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-17 Thread saravanan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325533#comment-14325533
 ] 

saravanan edited comment on SPARK-5389 at 2/18/15 7:31 AM:
---

I got the same issue in windows 7 and i set the PATH with only required folder 
details and removed  some of the path not required for spark and it worked for 
me.

Below is the PATH which i set in DOS prompt: 
--
PATH=C:\Program Files\Java\jdk1.7.0_72\bin;C:\Program Files\MKS 
Toolkit\mksnt;C:\PROGRA~1\MKSTOO~1\bin;C:\PROGRA~1\MKSTOO~1\bin\X11;C:\PROGRA~1\MKSTOO~1\mksnt;C:\Windows;C:\Windows\System32;C:\Windows\System32\wbem;C:\Program
 Files (x86)\nodejs\;

Below is the PATH which i removed from the existing PATH variables:

C:\Program Files\Java\jdk1.8.0_05/bin;C:\Python33\;
 
C:\IBM\InformationServer\ASBNode\apps\jre\bin\classic;C:\IBM\InformationServer\ASBNode\lib\cpp;C:\IBM\InformationServer\ASBNode\apps\proxy\cpp\vc60\MT_dll\bin;
c:\Program Files\Microsoft SQL Server\100\DTS\Binn\;
c:\Program Files\Microsoft SQL Server\100\Tools\Binn\;
c:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;
C:\Program Files\Java\jdk1.6.0_45\bin;


Note :
I am using spark version 1.2.0.

Thanks,
NS Saravanan




was (Author: saravanan303):
I got the same issue in windows 7 and i set the PATH with only required folder 
details and removed  some of the path not required for spark and it worked for 
me.

Below is the PATH which i set in DOS prompt: 
--
PATH=C:\Program Files\Java\jdk1.7.0_72\bin;C:\Program Files\MKS 
Toolkit\mksnt;C:\PROGRA~1\MKSTOO~1\bin;C:\PROGRA~1\MKSTOO~1\bin\X11;C:\PROGRA~1\MKSTOO~1\mksnt;C:\Windows;C:\Windows\System32;C:\Windows\System32\wbem;C:\Program
 Files (x86)\nodejs\;

Below is the PATH which i removed from the existing PATH variables:
C:\Program Files\Java\jdk1.8.0_05/bin;C:\Python33\;
 
C:\IBM\InformationServer\ASBNode\apps\jre\bin\classic;C:\IBM\InformationServer\ASBNode\lib\cpp;C:\IBM\InformationServer\ASBNode\apps\proxy\cpp\vc60\MT_dll\bin;
c:\Program Files\Microsoft SQL Server\100\DTS\Binn\;
c:\Program Files\Microsoft SQL Server\100\Tools\Binn\;
c:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;
C:\Program Files\Java\jdk1.6.0_45\bin;


Note :
I am using spark version 1.2.0.

Thanks,
NS Saravanan



> spark-shell.cmd does not run from DOS Windows 7
> ---
>
> Key: SPARK-5389
> URL: https://issues.apache.org/jira/browse/SPARK-5389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: Yana Kadiyska
>Priority: Trivial
> Attachments: SparkShell_Win7.JPG
>
>
> spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
> spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
> Marking as trivial sine calling spark-shell2.cmd also works fine
> Attaching a screenshot since the error isn't very useful:
> spark-1.2.0-bin-cdh4>bin\spark-shell.cmd
> else was unexpected at this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-17 Thread saravanan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325533#comment-14325533
 ] 

saravanan commented on SPARK-5389:
--

I got the same issue in windows 7 and i set the PATH with only required folder 
details and removed  some of the path not required for spark and it worked for 
me.

Below is the PATH which i set in DOS prompt: 
--
PATH=C:\Program Files\Java\jdk1.7.0_72\bin;C:\Program Files\MKS 
Toolkit\mksnt;C:\PROGRA~1\MKSTOO~1\bin;C:\PROGRA~1\MKSTOO~1\bin\X11;C:\PROGRA~1\MKSTOO~1\mksnt;C:\Windows;C:\Windows\System32;C:\Windows\System32\wbem;C:\Program
 Files (x86)\nodejs\;

Below is the PATH which i removed from the existing PATH variables:
C:\Program Files\Java\jdk1.8.0_05/bin;C:\Python33\;
 
C:\IBM\InformationServer\ASBNode\apps\jre\bin\classic;C:\IBM\InformationServer\ASBNode\lib\cpp;C:\IBM\InformationServer\ASBNode\apps\proxy\cpp\vc60\MT_dll\bin;
c:\Program Files\Microsoft SQL Server\100\DTS\Binn\;
c:\Program Files\Microsoft SQL Server\100\Tools\Binn\;
c:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;
C:\Program Files\Java\jdk1.6.0_45\bin;


Note :
I am using spark version 1.2.0.

Thanks,
NS Saravanan



> spark-shell.cmd does not run from DOS Windows 7
> ---
>
> Key: SPARK-5389
> URL: https://issues.apache.org/jira/browse/SPARK-5389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: Yana Kadiyska
>Priority: Trivial
> Attachments: SparkShell_Win7.JPG
>
>
> spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
> spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
> Marking as trivial sine calling spark-shell2.cmd also works fine
> Attaching a screenshot since the error isn't very useful:
> spark-1.2.0-bin-cdh4>bin\spark-shell.cmd
> else was unexpected at this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-5731.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
> stream receiving with multiple topics and smallest starting offset
> 
>
> Key: SPARK-5731
> URL: https://issues.apache.org/jira/browse/SPARK-5731
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Tests
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Tathagata Das
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 1.3.0
>
>
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 110 times over 20.070287525 seconds. Last failure 
> message: 300 did not equal 48 didn't get all messages.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at 
> org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
>   at 
> org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
>   at 
> org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
>   at 
> org.scal

[jira] [Commented] (SPARK-4912) Persistent data source tables

2015-02-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325433#comment-14325433
 ] 

Yin Huai commented on SPARK-4912:
-

The backport for the second issue is at 
https://github.com/apache/spark/pull/4671.

> Persistent data source tables
> -
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4903) RDD remains cached after "DROP TABLE"

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325434#comment-14325434
 ] 

Apache Spark commented on SPARK-4903:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4671

> RDD remains cached after "DROP TABLE"
> -
>
> Key: SPARK-4903
> URL: https://issues.apache.org/jira/browse/SPARK-4903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark master @ Dec 17 
> (3cd516191baadf8496ccdae499771020e89acd7e)
>Reporter: Evert Lammerts
>Priority: Critical
>
> In beeline, when I run:
> {code:sql}
> CREATE TABLE test AS select col from table;
> CACHE TABLE test
> DROP TABLE test
> {code}
> The the table is removed but the RDD is still cached. Running UNCACHE is not 
> possible anymore (table not found from metastore).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4912) Persistent data source tables

2015-02-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325426#comment-14325426
 ] 

Yin Huai commented on SPARK-4912:
-

[~kayousterhout] Seems the master still has the first issue. I have created 
SPARK-5881. For the second issue, I will have a backport for the second issue. 

> Persistent data source tables
> -
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5881) RDD remains cached after the table gets overridden by "CACHE TABLE"

2015-02-17 Thread Yin Huai (JIRA)
Yin Huai created SPARK-5881:
---

 Summary: RDD remains cached after the table gets overridden by 
"CACHE TABLE"
 Key: SPARK-5881
 URL: https://issues.apache.org/jira/browse/SPARK-5881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


{code}
val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
sqlContext.jsonRDD(rdd).registerTempTable("jt")

sqlContext.sql("CACHE TABLE foo AS SELECT * FROM jt")
sqlContext.sql("CACHE TABLE foo AS SELECT * FROM jt")
{code}
After the second CACHE TABLE command, the RDD for the first table still remains 
in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325406#comment-14325406
 ] 

Apache Spark commented on SPARK-5880:
-

User 'nitin2goyal' has created a pull request for this issue:
https://github.com/apache/spark/pull/4669

> Change log level of batch pruning string in InMemoryColumnarTableScan from 
> Info to Debug
> 
>
> Key: SPARK-5880
> URL: https://issues.apache.org/jira/browse/SPARK-5880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Nitin Goyal
>Priority: Trivial
> Fix For: 1.3.0
>
>
> In InMemoryColumnarTableScan, we make string of the statistics of all the 
> columns and log them at INFO level whenever batch pruning happens. We get a 
> performance hit in case there are a large number of batches and good number 
> of columns and almost every batch gets pruned.
> We can make the string to evaluate lazily and change log level to DEBUG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4912) Persistent data source tables

2015-02-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325388#comment-14325388
 ] 

Kay Ousterhout edited comment on SPARK-4912 at 2/18/15 4:29 AM:


Is it possible to backport this to 1.2?  It fixes 2 annoying issues:

(1) If you do:

cache table foo as ;
cache table foo;

The second cache table creates a 2nd, new RDD, meaning the first cached RDD is 
stuck in-memory and can't be deleted ("uncache foo" just deletes the 2nd RDD, 
but the 1st one is still there)

(2) SPARK-4903


was (Author: kayousterhout):
Is it possible to backport this to 1.2?  It fixes 2 annoying issues:

(1) If you do:

cache table foo as ;
cache table foo;

The second cache table creates a 2nd, new RDD, meaning the first cached RDD is 
stuck in-memory and can't be deleted ("uncache foo" just deletes the 2nd RDD, 
but the 1st one is still there)

(2)

cache table foo as ...;
drop table foo;

Leaves foo still in-memory (and, similar to the above, now un-deletable).

> Persistent data source tables
> -
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4912) Persistent data source tables

2015-02-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325388#comment-14325388
 ] 

Kay Ousterhout edited comment on SPARK-4912 at 2/18/15 4:23 AM:


Is it possible to backport this to 1.2?  It fixes 2 annoying issues:

(1) If you do:

cache table foo as ;
cache table foo;

The second cache table creates a 2nd, new RDD, meaning the first cached RDD is 
stuck in-memory and can't be deleted ("uncache foo" just deletes the 2nd RDD, 
but the 1st one is still there)

(2)

cache table foo as ...;
drop table foo;

Leaves foo still in-memory (and, similar to the above, now un-deletable).


was (Author: kayousterhout):
Is it possible to backport this to 1.2?  It fixes an annoying issue where if 
you do:

cache table foo as ;
cache table foo;

The second cache table creates a 2nd, new RDD, meaning the first cached RDD is 
stuck in-memory and can't be deleted ("uncache foo" just deletes the 2nd RDD, 
but the 1st one is still there)

> Persistent data source tables
> -
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4912) Persistent data source tables

2015-02-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325388#comment-14325388
 ] 

Kay Ousterhout commented on SPARK-4912:
---

Is it possible to backport this to 1.2?  It fixes an annoying issue where if 
you do:

cache table foo as ;
cache table foo;

The second cache table creates a 2nd, new RDD, meaning the first cached RDD is 
stuck in-memory and can't be deleted ("uncache foo" just deletes the 2nd RDD, 
but the 1st one is still there)

> Persistent data source tables
> -
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325372#comment-14325372
 ] 

Nicholas Chammas commented on SPARK-5629:
-

YAML is not part of the Python standard library, unfortunately.

Agree on somehow marking this as experimental.

I think YAML is like JSON in that adding or removing fields shouldn't break any 
parsers. There is no pre-defined schema. It should just affect you if you are 
trying to access a field that was removed, for example, just like it would with 
JSON.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug

2015-02-17 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-5880:
--

 Summary: Change log level of batch pruning string in 
InMemoryColumnarTableScan from Info to Debug
 Key: SPARK-5880
 URL: https://issues.apache.org/jira/browse/SPARK-5880
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Nitin Goyal
Priority: Trivial
 Fix For: 1.3.0


In InMemoryColumnarTableScan, we make string of the statistics of all the 
columns and log them at INFO level whenever batch pruning happens. We get a 
performance hit in case there are a large number of batches and good number of 
columns and almost every batch gets pruned.

We can make the string to evaluate lazily and change log level to DEBUG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325354#comment-14325354
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

This sounds fine to me and I really like YAML -- Does Python have native 
support for printing out YAML ?
One thing we should do is probably marking this as experimental as we might not 
be able maintain backwards compatibility etc. (On that note are YAML parsers 
backwards compatible ? i.e. if we add a new field in the next release will it 
break things ?)

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325346#comment-14325346
 ] 

Nicholas Chammas commented on SPARK-5629:
-

For example, you run:

{code}
$ spark-ec2 describe my-spark-cluster
{code}

And you get back something like this:

{code}
my-spark-cluster:
  launched: "2015-02-18 14:03:22 UTC"
  status: running
  nodes:
master: ec2-54-69-105-224.us-west-2.compute.amazonaws.com
slaves:
  - ec2-54-69-1215-97.us-west-2.compute.amazonaws.com
  - ec2-54-69-186-101.us-west-2.compute.amazonaws.com
  - ec2-54-69-186-109.us-west-2.compute.amazonaws.com
{code}

Actually, since this is both valid YAML and very human-readable, we probably 
don't need the {{--machine-readable}} option mentioned in the ticket body.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-925) Allow ec2 scripts to load default options from a json file

2015-02-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325334#comment-14325334
 ] 

Nicholas Chammas commented on SPARK-925:


Here's an example of what a spark-ec2 {{config.yml}} file could look like:

{code}
region: us-east-1

aws_auth:
  key_pair: mykey
  identity_file: /path/to/file.pem

# spark_version: 1.2.1

slaves: 5
instance_type: m3.large

use_existing_master: no
{code}

It's dead simple and there's not much to learn, really.

> Allow ec2 scripts to load default options from a json file
> --
>
> Key: SPARK-925
> URL: https://issues.apache.org/jira/browse/SPARK-925
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.8.0
>Reporter: Shay Seng
>Priority: Minor
>
> The option list for ec2 script can be a little irritating to type in, 
> especially things like path to identity-file, region , zone, ami etc.
> It would be nice if ec2 script looks for an options.json file in the 
> following order: (1) PWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py
> Something like:
> def get_defaults_from_options():
>   # Check to see if a options.json file exists, if so load it. 
>   # However, values in the options.json file can only overide values in opts
>   # if the Opt values are None or ""
>   # i.e. commandline options take presidence 
>   defaults = 
> {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 
> 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 
> 'ami':'','slaves':1, 'instance-type':'m1.large'}
>   # Look for options.json in directory cluster was called from
>   # Had to modify the spark_ec2 wrapper script since it mangles the pwd
>   startwd = os.environ['STARTWD']
>   if os.path.exists(os.path.join(startwd,"options.json")):
>   optionspath = os.path.join(startwd,"options.json")
>   else:
>   optionspath = os.path.join(os.getcwd(),"options.json")
>   
>   try:
> print "Loading options file: ", optionspath  
> with open (optionspath) as json_data:
> jdata = json.load(json_data)
> for k in jdata:
>   defaults[k]=jdata[k]
>   except IOError:
> print 'Warning: options.json file not loaded'
>   # Check permissions on identity-file, if defined, otherwise launch will 
> fail late and will be irritating
>   if defaults['identity-file']!='':
> st = os.stat(defaults['identity-file'])
> user_can_read = bool(st.st_mode & stat.S_IRUSR)
> grp_perms = bool(st.st_mode & stat.S_IRWXG)
> others_perm = bool(st.st_mode & stat.S_IRWXO)
> if (not user_can_read):
>   print "No read permission to read ", defaults['identify-file']
>   sys.exit(1)
> if (grp_perms or others_perm):
>   print "Permissions are too open, please chmod 600 file ", 
> defaults['identify-file']
>   sys.exit(1)
>   # if defaults contain AWS access id or private key, set it to environment. 
>   # required for use with boto to access the AWS console 
>   if defaults['aws-access-key-id'] != '':
> os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] 
>   if defaults['aws-secret-access-key'] != '':   
> os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key']
>   return defaults  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5723) Change the default file format to Parquet for CTAS statements.

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5723.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4639
[https://github.com/apache/spark/pull/4639]

> Change the default file format to Parquet for CTAS statements.
> --
>
> Key: SPARK-5723
> URL: https://issues.apache.org/jira/browse/SPARK-5723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Right now, if you issue a CTAS queries without specifying file format and 
> serde info, we will use TextFile. We should switch to Parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-925) Allow ec2 scripts to load default options from a json file

2015-02-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325317#comment-14325317
 ] 

Nicholas Chammas commented on SPARK-925:


I would prefer a format that is more human friendly and that supports comments 
directly. To me, JSON is better for data exchange, and YAML is better for 
config files and other things that humans are going to be dealing with directly.

It's true that there are other config formats used in Spark. The ones under 
[conf/|https://github.com/apache/spark/tree/master/conf], however, are not 
JSON. Which ones were you thinking of?

As long as the config format is consistent within a sub-project, I think it's 
OK. Since spark-ec2 doesn't have any config files yet, I don't think it's bad 
to go with YAML.

{quote}
With JSON we deal with internally, we have started to nest definitions so that 
it is easy for some one to modify one small setting without having to specify 
all the other settings – and as a work around to comments.
{quote}

As discussed before, YAML supports comments directly, which IMO is essential 
for a config format. With regards to modifying a setting without specifying 
everything, I'm not sure I understand the use case.

If we define some config file resolution order (first check /first/config, then 
check /second/config, etc.), is it that bad if people just copied the default 
config from /second/config to /first/config and modified what they wanted? I 
believe that's how it generally works in tools that check multiple places for 
configuration.

A better way to do this would probably be to allow people to specify a sub-set 
of options in any given file, and option sets just get merged on top of the 
options specified in the preceding file. That seems more complexity than is 
worth it at this time, though.

> Allow ec2 scripts to load default options from a json file
> --
>
> Key: SPARK-925
> URL: https://issues.apache.org/jira/browse/SPARK-925
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.8.0
>Reporter: Shay Seng
>Priority: Minor
>
> The option list for ec2 script can be a little irritating to type in, 
> especially things like path to identity-file, region , zone, ami etc.
> It would be nice if ec2 script looks for an options.json file in the 
> following order: (1) PWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py
> Something like:
> def get_defaults_from_options():
>   # Check to see if a options.json file exists, if so load it. 
>   # However, values in the options.json file can only overide values in opts
>   # if the Opt values are None or ""
>   # i.e. commandline options take presidence 
>   defaults = 
> {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 
> 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 
> 'ami':'','slaves':1, 'instance-type':'m1.large'}
>   # Look for options.json in directory cluster was called from
>   # Had to modify the spark_ec2 wrapper script since it mangles the pwd
>   startwd = os.environ['STARTWD']
>   if os.path.exists(os.path.join(startwd,"options.json")):
>   optionspath = os.path.join(startwd,"options.json")
>   else:
>   optionspath = os.path.join(os.getcwd(),"options.json")
>   
>   try:
> print "Loading options file: ", optionspath  
> with open (optionspath) as json_data:
> jdata = json.load(json_data)
> for k in jdata:
>   defaults[k]=jdata[k]
>   except IOError:
> print 'Warning: options.json file not loaded'
>   # Check permissions on identity-file, if defined, otherwise launch will 
> fail late and will be irritating
>   if defaults['identity-file']!='':
> st = os.stat(defaults['identity-file'])
> user_can_read = bool(st.st_mode & stat.S_IRUSR)
> grp_perms = bool(st.st_mode & stat.S_IRWXG)
> others_perm = bool(st.st_mode & stat.S_IRWXO)
> if (not user_can_read):
>   print "No read permission to read ", defaults['identify-file']
>   sys.exit(1)
> if (grp_perms or others_perm):
>   print "Permissions are too open, please chmod 600 file ", 
> defaults['identify-file']
>   sys.exit(1)
>   # if defaults contain AWS access id or private key, set it to environment. 
>   # required for use with boto to access the AWS console 
>   if defaults['aws-access-key-id'] != '':
> os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] 
>   if defaults['aws-secret-access-key'] != '':   
> os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key']
>   return defaults  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mai

[jira] [Resolved] (SPARK-5875) logical.Project should not be resolved if it contains aggregates or generators

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5875.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4663
[https://github.com/apache/spark/pull/4663]

> logical.Project should not be resolved if it contains aggregates or generators
> --
>
> Key: SPARK-5875
> URL: https://issues.apache.org/jira/browse/SPARK-5875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> To reproduce...
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).registerTempTable("jt")
> sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
> sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) 
> AS val FROM jt LIMIT 1")
> {code}
> The exception is
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from 
> array> to int;
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
> {code}
> The cause of this exception is that PreInsertionCasts in HiveMetastoreCatalog 
> was triggered on an invalid query plan 
> {code}
> Project 
> [HiveGenericUdtf#org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode(Array(1,2,3))
>  AS val#19]
>   Subquery jt
>LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at map at JsonRDD.scala:41
> {code}
> Then, after the transformation of PreInsertionCasts, ImplicitGenerate cannot 
> be applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5810) Maven Coordinate Inclusion failing in pySpark

2015-02-17 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-5810.

Resolution: Fixed

Fixed with SPARK-5811 & SPARK-2313

> Maven Coordinate Inclusion failing in pySpark
> -
>
> Key: SPARK-5810
> URL: https://issues.apache.org/jira/browse/SPARK-5810
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.3.0
>
>
> When including maven coordinates to download dependencies in pyspark, pyspark 
> returns a GatewayError, because it cannot read the proper port to communicate 
> with the JVM. This is because pyspark relies on STDIN to read the port number 
> and in the meantime Ivy prints out a whole lot of logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler

2015-02-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Labels: backport-needed  (was: )

> Race condition in DAGScheduler
> --
>
> Key: SPARK-4454
> URL: https://issues.apache.org/jira/browse/SPARK-4454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.0
>Reporter: Rafal Kwasny
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> It seems to be a race condition in DAGScheduler that manifests on jobs with 
> high concurrency:
> {noformat}
>  Exception in thread "main" java.util.NoSuchElementException: key not found: 
> 35
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
> at 
> org.apache.spark.SparkContext.getPreferredLocs(

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler

2015-02-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0)

> Race condition in DAGScheduler
> --
>
> Key: SPARK-4454
> URL: https://issues.apache.org/jira/browse/SPARK-4454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.0
>Reporter: Rafal Kwasny
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> It seems to be a race condition in DAGScheduler that manifests on jobs with 
> high concurrency:
> {noformat}
>  Exception in thread "main" java.util.NoSuchElementException: key not found: 
> 35
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
> at 
> org.apache.spark.SparkContext.getPr

[jira] [Reopened] (SPARK-4454) Race condition in DAGScheduler

2015-02-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4454:


Actually, re-opening this since we need to back port it.

> Race condition in DAGScheduler
> --
>
> Key: SPARK-4454
> URL: https://issues.apache.org/jira/browse/SPARK-4454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.0
>Reporter: Rafal Kwasny
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> It seems to be a race condition in DAGScheduler that manifests on jobs with 
> high concurrency:
> {noformat}
>  Exception in thread "main" java.util.NoSuchElementException: key not found: 
> 35
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
> at 
> org.apache.spark.SparkContext.getPreferredLocs(SparkContext.sca

[jira] [Resolved] (SPARK-4454) Race condition in DAGScheduler

2015-02-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4454.

   Resolution: Fixed
Fix Version/s: 1.3.0

We can't be 100% sure this is fixed because it was not a reproducible issue. 
However, Josh has committed a patch that I think should make it hard to have 
race conditions around the cache location data structure.

> Race condition in DAGScheduler
> --
>
> Key: SPARK-4454
> URL: https://issues.apache.org/jira/browse/SPARK-4454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.0
>Reporter: Rafal Kwasny
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> It seems to be a race condition in DAGScheduler that manifests on jobs with 
> high concurrency:
> {noformat}
>  Exception in thread "main" java.util.NoSuchElementException: key not found: 
> 35
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInte

[jira] [Created] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-17 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5879:
--

 Summary: spary_ec2.py should expose/return master and slave lists 
(e.g. write to file)
 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein



After running spark_ec2.py, it is often useful/necessary to know the master's 
ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.

For example, consider a wrapper that launches a cluster, then waits for 
completion of some application running on it (e.g. polling via ssh), before 
destroying the cluster.

Some options: 
- write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a 
subset of the ec2_variables.sh that is temporarily created as part of 
deploy_files variable substitution)
- launch-variables.json (same info but as json) 

Both would be useful depending on the wrapper language. 

I think we should incorporate the cluster name for the case that multiple 
clusters are launched. E.g. _variables.sh/.json

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell

2015-02-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5811.

Resolution: Fixed
  Assignee: Burak Yavuz

> Documentation for --packages and --repositories on Spark Shell
> --
>
> Key: SPARK-5811
> URL: https://issues.apache.org/jira/browse/SPARK-5811
> Project: Spark
>  Issue Type: Documentation
>  Components: Deploy, Spark Shell
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Critical
> Fix For: 1.3.0
>
>
> Documentation for the new support for dependency management using maven 
> coordinates using --packages and --repositories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5785:
--
Assignee: Davies Liu

> Pyspark does not support narrow dependencies
> 
>
> Key: SPARK-5785
> URL: https://issues.apache.org/jira/browse/SPARK-5785
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.3.0
>
>
> joins (& cogroups etc.) are always considered to have "wide" dependencies in 
> pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
> this simple job should shuffle rddA & rddB once each, but it also will do a 
> third shuffle of the unioned data:
> {code}
> rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
> rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
> joined = rddA.join(rddB)
> joined.count()
> >>> rddA._partitionFunc == rddB._partitionFunc
> True
> {code}
> (Or the docs should somewhere explain that this feature is missing from 
> pyspark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5785.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4629
[https://github.com/apache/spark/pull/4629]

> Pyspark does not support narrow dependencies
> 
>
> Key: SPARK-5785
> URL: https://issues.apache.org/jira/browse/SPARK-5785
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Imran Rashid
> Fix For: 1.3.0
>
>
> joins (& cogroups etc.) are always considered to have "wide" dependencies in 
> pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
> this simple job should shuffle rddA & rddB once each, but it also will do a 
> third shuffle of the unioned data:
> {code}
> rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
> rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
> joined = rddA.join(rddB)
> joined.count()
> >>> rddA._partitionFunc == rddB._partitionFunc
> True
> {code}
> (Or the docs should somewhere explain that this feature is missing from 
> pyspark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5878) Python DataFrame.repartition() is broken

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325220#comment-14325220
 ] 

Apache Spark commented on SPARK-5878:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4667

> Python DataFrame.repartition() is broken
> 
>
> Key: SPARK-5878
> URL: https://issues.apache.org/jira/browse/SPARK-5878
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5878) Python DataFrame.repartition() is broken

2015-02-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-5878:
-

 Summary: Python DataFrame.repartition() is broken
 Key: SPARK-5878
 URL: https://issues.apache.org/jira/browse/SPARK-5878
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325209#comment-14325209
 ] 

Apache Spark commented on SPARK-5722:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4666

> Infer_schema_type incorrect for Integers in pyspark
> ---
>
> Key: SPARK-5722
> URL: https://issues.apache.org/jira/browse/SPARK-5722
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Don Drake
>
> The Integers datatype in Python does not match what a Scala/Java integer is 
> defined as.   This causes inference of data types and schemas to fail when 
> data is larger than 2^32 and it is inferred incorrectly as an Integer.
> Since the range of valid Python integers is wider than Java Integers, this 
> causes problems when inferring Integer vs. Long datatypes.  This will cause 
> problems when attempting to save SchemaRDD as Parquet or JSON.
> Here's an example:
> {code}
> >>> sqlCtx = SQLContext(sc)
> >>> from pyspark.sql import Row
> >>> rdd = sc.parallelize([Row(f1='a', f2=100)])
> >>> srdd = sqlCtx.inferSchema(rdd)
> >>> srdd.schema()
> StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
> {code}
> That number is a LongType in Java, but an Integer in python.  We need to 
> check the value to see if it should really by a LongType when a IntegerType 
> is initially inferred.
> More tests:
> {code}
> >>> from pyspark.sql import _infer_type
> # OK
> >>> print _infer_type(1)
> IntegerType
> # OK
> >>> print _infer_type(2**31-1)
> IntegerType
> #WRONG
> >>> print _infer_type(2**31)
> #WRONG
> IntegerType
> >>> print _infer_type(2**61 )
> #OK
> IntegerType
> >>> print _infer_type(2**71 )
> LongType
> {code}
> Java Primitive Types defined:
> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
> Python Built-in Types:
> https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4579) Scheduling Delay appears negative

2015-02-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325202#comment-14325202
 ] 

Kay Ousterhout commented on SPARK-4579:
---

This is only for running tasks, I'm guessing because we compute the scheduler 
delay as something like finish time - start time - some stuff, and the finish 
time is 0 while a task is running.

> Scheduling Delay appears negative
> -
>
> Key: SPARK-4579
> URL: https://issues.apache.org/jira/browse/SPARK-4579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Arun Ahuja
>Priority: Minor
>
> !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5877) Scheduler delay is incorrect for running tasks

2015-02-17 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-5877:
--
Priority: Minor  (was: Major)

> Scheduler delay is incorrect for running tasks
> --
>
> Key: SPARK-5877
> URL: https://issues.apache.org/jira/browse/SPARK-5877
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Priority: Minor
>  Labels: starter
>
> For running tasks, the scheduler delay is shown is negative (huge value) in 
> the UI.  I'm guessing this is because we compute the scheduler delay as 
> something like finish time - start time - some stuff, and the finish time is 
> 0 while a task is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5877) Scheduler delay is incorrect for running tasks

2015-02-17 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-5877.
-
Resolution: Duplicate

> Scheduler delay is incorrect for running tasks
> --
>
> Key: SPARK-5877
> URL: https://issues.apache.org/jira/browse/SPARK-5877
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Priority: Minor
>  Labels: starter
>
> For running tasks, the scheduler delay is shown is negative (huge value) in 
> the UI.  I'm guessing this is because we compute the scheduler delay as 
> something like finish time - start time - some stuff, and the finish time is 
> 0 while a task is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5877) Scheduler delay is incorrect for running tasks

2015-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325193#comment-14325193
 ] 

Sean Owen commented on SPARK-5877:
--

Same as SPARK-4579?

> Scheduler delay is incorrect for running tasks
> --
>
> Key: SPARK-5877
> URL: https://issues.apache.org/jira/browse/SPARK-5877
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>  Labels: starter
>
> For running tasks, the scheduler delay is shown is negative (huge value) in 
> the UI.  I'm guessing this is because we compute the scheduler delay as 
> something like finish time - start time - some stuff, and the finish time is 
> 0 while a task is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3674) Add support for launching YARN clusters in spark-ec2

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3674:
-
Issue Type: Improvement  (was: Bug)

> Add support for launching YARN clusters in spark-ec2
> 
>
> Key: SPARK-3674
> URL: https://issues.apache.org/jira/browse/SPARK-3674
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> Right now spark-ec2 only supports launching Spark Standalone clusters. While 
> this is sufficient for basic usage it is hard to test features or do 
> performance benchmarking on YARN. It will be good to add support for 
> installing, configuring a Apache YARN cluster at a fixed version -- say the 
> latest stable version 2.4.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5877) Scheduler delay is incorrect for running tasks

2015-02-17 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-5877:
-

 Summary: Scheduler delay is incorrect for running tasks
 Key: SPARK-5877
 URL: https://issues.apache.org/jira/browse/SPARK-5877
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1, 1.3.0
Reporter: Kay Ousterhout


For running tasks, the scheduler delay is shown is negative (huge value) in the 
UI.  I'm guessing this is because we compute the scheduler delay as something 
like finish time - start time - some stuff, and the finish time is 0 while a 
task is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4299) In spark-submit, the driver-memory value is used for the SPARK_SUBMIT_DRIVER_MEMORY value

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4299.
--
Resolution: Duplicate

> In spark-submit, the driver-memory value is used for the 
> SPARK_SUBMIT_DRIVER_MEMORY value
> -
>
> Key: SPARK-4299
> URL: https://issues.apache.org/jira/browse/SPARK-4299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Virgile Devaux
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> In the spark-submit script, the lines below:
> elif [ "$1" = "--driver-memory" ]; then
> export SPARK_SUBMIT_DRIVER_MEMORY=$2
> are wrong: spark-submit is not the process that will handle the driver when 
> you're in yarn-cluster mode. So, when I lanch spark-submit on a light server 
> with only 2Gb of memory and want to allocate 4gb of memory to the driver 
> (that will run in the ressource manager on a big fat yarn server with, say, 
> 64Gb of RAM) spark submit fails with a OutOfMemory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4299) In spark-submit, the driver-memory value is used for the SPARK_SUBMIT_DRIVER_MEMORY value

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-4299:
--

Wait a sec. I am not clear it's resolved on testing with 1.3.0. But, it is a 
duplicate of SPARK-3884

> In spark-submit, the driver-memory value is used for the 
> SPARK_SUBMIT_DRIVER_MEMORY value
> -
>
> Key: SPARK-4299
> URL: https://issues.apache.org/jira/browse/SPARK-4299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Virgile Devaux
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> In the spark-submit script, the lines below:
> elif [ "$1" = "--driver-memory" ]; then
> export SPARK_SUBMIT_DRIVER_MEMORY=$2
> are wrong: spark-submit is not the process that will handle the driver when 
> you're in yarn-cluster mode. So, when I lanch spark-submit on a light server 
> with only 2Gb of memory and want to allocate 4gb of memory to the driver 
> (that will run in the ressource manager on a big fat yarn server with, say, 
> 64Gb of RAM) spark submit fails with a OutOfMemory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4299) In spark-submit, the driver-memory value is used for the SPARK_SUBMIT_DRIVER_MEMORY value

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4299.
--
Resolution: Not a Problem

This may have been fixed along the way, but from examining related issues 
recently (like https://issues.apache.org/jira/browse/SPARK-5861) I know that 
yarn-cluster mode does not set the driver process's JVM heap size since it's 
not the driver.

> In spark-submit, the driver-memory value is used for the 
> SPARK_SUBMIT_DRIVER_MEMORY value
> -
>
> Key: SPARK-4299
> URL: https://issues.apache.org/jira/browse/SPARK-4299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Virgile Devaux
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> In the spark-submit script, the lines below:
> elif [ "$1" = "--driver-memory" ]; then
> export SPARK_SUBMIT_DRIVER_MEMORY=$2
> are wrong: spark-submit is not the process that will handle the driver when 
> you're in yarn-cluster mode. So, when I lanch spark-submit on a light server 
> with only 2Gb of memory and want to allocate 4gb of memory to the driver 
> (that will run in the ressource manager on a big fat yarn server with, say, 
> 64Gb of RAM) spark submit fails with a OutOfMemory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4368) Ceph integration?

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4368:
-
Issue Type: Improvement  (was: Bug)

I don't think this ever evolved into a proposal to change Spark, and 
non-essential integration is generally directed to hosting outside the project 
now.

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4368) Ceph integration?

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4368.
--
Resolution: Won't Fix

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5570) No docs stating that `new SparkConf().set("spark.driver.memory", ...) will not work

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325155#comment-14325155
 ] 

Apache Spark commented on SPARK-5570:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4665

> No docs stating that `new SparkConf().set("spark.driver.memory", ...) will 
> not work
> ---
>
> Key: SPARK-5570
> URL: https://issues.apache.org/jira/browse/SPARK-5570
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.2.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4449) specify port range in spark

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4449:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.2.0)
  Issue Type: Improvement  (was: Bug)

> specify port range in spark
> ---
>
> Key: SPARK-4449
> URL: https://issues.apache.org/jira/browse/SPARK-4449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Fei Wang
>Priority: Minor
>
>  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4796) Spark does not remove temp files

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4796.
--
Resolution: Duplicate

These are the same issue; not sure which way to resolve this. I am not clear 
that temp files are being left behind that shouldn't be there at all; that 
doesn't mean there's no way to reduce the number of temp files. The other issue 
pertains to highlighting this issue at least.

> Spark does not remove temp files
> 
>
> Key: SPARK-4796
> URL: https://issues.apache.org/jira/browse/SPARK-4796
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.1.0
> Environment: I'm runnin spark on mesos and mesos slaves are docker 
> containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, 
> docker 1.2.0.
>Reporter: Ian Babrou
>
> I started a job that cannot fill into memory and got "no space left on 
> device". That was fair, because docker containers only have 10gb of disk 
> space and some is taken by OS already.
> But then I found out when job failed it didn't release any disk space and 
> left container without any free disk space.
> Then I decided to check if spark removes temp files in any case, because many 
> mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after 
> spark task is finished. I attached with strace to running job:
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898"
>  
> [pid 30212] <... unlink resumed> )  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0")
>  = 0
> And after job is finished, some files are still there:
> /tmp/spark-local-20141209091330-48b5/
> /tmp/spark-local-20141209091330-48b5/11
> /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4
> /tmp/spark-local-20141209091330-48b5/32
> /tmp/spark-local-20141209091330-48b5/04
> /tmp/spark-local-20141209091330-48b5/05
> /tmp/spark-local-20141209091330-48b5/0f
> /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2
> /tmp/spark-local-20141209091330-48b5/3d
> /tmp/spark-local-20141209091330-48b5/0e
> /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1
> /tmp/spark-local-20141209091330-48b5/15
> /tmp/spark-local-20141209091330-48b5/0d
> /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0
> /tmp/spark-local-20141209091330-48b5/36
> /tmp/spark-local-20141209091330-48b5/31
> /tmp/spark-local-20141209091330-48b5/12
> /tmp/spark-local-20141209091330-48b5/21
> /tmp/spark-local-20141209091330-48b5/10
> /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3
> /tmp/spark-local-20141209091330-48b5/1e
> /tmp/spark-local-20141209091330-48b5/35
> If I look into my mesos slaves, there are mostly "shuffle" files, overall 
> picture for single node:
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l
> 781
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l
> 10
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle
> /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3
> /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111
> /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705
> /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e
> /tmp/spark-local-20141119144512-67c4/24/temp_f0e4c

[jira] [Commented] (SPARK-5507) Add user guide for block matrix and its operations

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325145#comment-14325145
 ] 

Apache Spark commented on SPARK-5507:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4664

> Add user guide for block matrix and its operations
> --
>
> Key: SPARK-5507
> URL: https://issues.apache.org/jira/browse/SPARK-5507
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> User guide should cover converters from/to block matrices and linear algebra 
> operations we support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5852) Fail to convert a newly created empty metastore parquet table to a data source parquet table.

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5852.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4655
[https://github.com/apache/spark/pull/4655]

> Fail to convert a newly created empty metastore parquet table to a data 
> source parquet table.
> -
>
> Key: SPARK-5852
> URL: https://issues.apache.org/jira/browse/SPARK-5852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> To reproduce the exception, try
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).registerTempTable("jt")
> sqlContext.sql("create table test stored as parquet as select * from jt")
> {code}
> ParquetConversions tries to convert the write path to the data source API 
> write path. But, the following exception was thrown.
> {code}
> java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:47)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:633)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:349)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:290)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:354)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:218)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:440)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:439)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>   at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:439)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:416)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:917)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:917)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:918)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:919)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:919)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycom

[jira] [Resolved] (SPARK-5872) pyspark shell should start up with SQL/HiveContext

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5872.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4659
[https://github.com/apache/spark/pull/4659]

> pyspark shell should start up with SQL/HiveContext
> --
>
> Key: SPARK-5872
> URL: https://issues.apache.org/jira/browse/SPARK-5872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5821) JSONRelation should check if delete is successful for the overwrite operation.

2015-02-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5821:

Priority: Major  (was: Blocker)

> JSONRelation should check if delete is successful for the overwrite operation.
> --
>
> Key: SPARK-5821
> URL: https://issues.apache.org/jira/browse/SPARK-5821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yanbo Liang
>
> When you run CTAS command such as
> "CREATE TEMPORARY TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
> path /a/b/c/d
> ) AS
> SELECT a, b FROM jt",
> you will run into failure if you don't have write permission for directory 
> /a/b/c whether d is a directory or file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5875) logical.Project should not be resolved if it contains aggregates or generators

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325121#comment-14325121
 ] 

Apache Spark commented on SPARK-5875:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4663

> logical.Project should not be resolved if it contains aggregates or generators
> --
>
> Key: SPARK-5875
> URL: https://issues.apache.org/jira/browse/SPARK-5875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> To reproduce...
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).registerTempTable("jt")
> sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
> sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) 
> AS val FROM jt LIMIT 1")
> {code}
> The exception is
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from 
> array> to int;
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
> {code}
> The cause of this exception is that PreInsertionCasts in HiveMetastoreCatalog 
> was triggered on an invalid query plan 
> {code}
> Project 
> [HiveGenericUdtf#org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode(Array(1,2,3))
>  AS val#19]
>   Subquery jt
>LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at map at JsonRDD.scala:41
> {code}
> Then, after the transformation of PreInsertionCasts, ImplicitGenerate cannot 
> be applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5876) generalize the type of categoricalFeaturesInfo to PartialFunction[Int, Int]

2015-02-17 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-5876:
-

 Summary: generalize the type of categoricalFeaturesInfo to 
PartialFunction[Int, Int]
 Key: SPARK-5876
 URL: https://issues.apache.org/jira/browse/SPARK-5876
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Erik Erlandson
Priority: Minor


The decision tree training takes a parameter {{categoricalFeaturesInfo}} of 
type {{Map\[Int,Int\]}} that encodes information about any features that are 
categories and how many categorical values are present.

It would be useful to generalize this type to its superclass 
{{PartialFunction\[Int,Int\]}}, which would be backward compatible with 
{{Map\[Int,Int\]}}, but can also accept a {{Seq\[Int\]}}, or any other partial 
function.

Would need to verify that any tests for key definition in the mapping are using 
{{isDefinedAt(key)}} instead of {{contains(key)}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5875) logical.Project should not be resolved if it contains aggregates or generators

2015-02-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5875:

Description: 
To reproduce...
{code}
val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
sqlContext.jsonRDD(rdd).registerTempTable("jt")
sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) AS 
val FROM jt LIMIT 1")
{code}
The exception is
{code}
org.apache.spark.sql.AnalysisException: invalid cast from 
array> to int;
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
{code}

The cause of this exception is that PreInsertionCasts in HiveMetastoreCatalog 
was triggered on an invalid query plan 
{code}
Project 
[HiveGenericUdtf#org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode(Array(1,2,3))
 AS val#19]
  Subquery jt
   LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at map at JsonRDD.scala:41
{code}
Then, after the transformation of PreInsertionCasts, ImplicitGenerate cannot be 
applied.

  was:
{code}
val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
sqlContext.jsonRDD(rdd).registerTempTable("jt")
sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) AS 
val FROM jt LIMIT 1")
{code}

{code}
org.apache.spark.sql.AnalysisException: invalid cast from 
array> to int;
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
{code}


> logical.Project should not be resolved if it contains aggregates or generators
> --
>
> Key: SPARK-5875
> URL: https://issues.apache.org/jira/browse/SPARK-5875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> To reproduce...
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).registerTempTable("jt")
> sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
> sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) 
> AS val FROM jt LIMIT 1")
> {code}
> The exception is
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from 
> array> to int;
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
> {code}
> The cause of this exception is that PreInsertionCasts in HiveMetastoreCatalog 
> was triggered on an invalid query plan 
> {code}
> Project

[jira] [Updated] (SPARK-5875) logical.Project should not be resolved if it contains aggregates or generators

2015-02-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5875:

Description: 
{code}
val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
sqlContext.jsonRDD(rdd).registerTempTable("jt")
sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) AS 
val FROM jt LIMIT 1")
{code}

{code}
org.apache.spark.sql.AnalysisException: invalid cast from 
array> to int;
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
{code}

> logical.Project should not be resolved if it contains aggregates or generators
> --
>
> Key: SPARK-5875
> URL: https://issues.apache.org/jira/browse/SPARK-5875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).registerTempTable("jt")
> sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
> sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) 
> AS val FROM jt LIMIT 1")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from 
> array> to int;
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5875) logical.Project should not be resolved if it contains aggregates or generators

2015-02-17 Thread Yin Huai (JIRA)
Yin Huai created SPARK-5875:
---

 Summary: logical.Project should not be resolved if it contains 
aggregates or generators
 Key: SPARK-5875
 URL: https://issues.apache.org/jira/browse/SPARK-5875
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5065) BroadCast can still work after sc had been stopped.

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5065.
---
Resolution: Fixed
  Assignee: Josh Rosen

This was fixed as part of my PR for SPARK-5063

> BroadCast can still work after sc had been stopped.
> ---
>
> Key: SPARK-5065
> URL: https://issues.apache.org/jira/browse/SPARK-5065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: SaintBacchus
>Assignee: Josh Rosen
>Priority: Minor
>
> Code as follow:
> {code:borderStyle=solid}
> val sc1 = new SparkContext
> val sc2 = new SparkContext
> sc1.stop
> sc1.broadcast(1)
> {code}
> It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
> To fix it, throw a sparkException when broadCastManager had stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5607) NullPointerException in objenesis

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5607.
---
Resolution: Fixed

I think that this has been fixed by my patch that removes EasyMock, so I'm 
going to mark this as Resolved.

> NullPointerException in objenesis
> -
>
> Key: SPARK-5607
> URL: https://issues.apache.org/jira/browse/SPARK-5607
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Reynold Xin
>Assignee: Patrick Wendell
> Fix For: 1.3.0
>
>
> Tests are sometimes failing with the following exception.
> The problem might be that Kryo is using a different version of objenesis from 
> Mockito.
> {code}
> [info] - Process succeeds instantly *** FAILED *** (107 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52)
> [info]   at 
> org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90)
> [info]   at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
> [info]   at 
> org.mockito.internal.creation.jmock.ClassImposterizer.createProxy(ClassImposterizer.java:111)
> [info]   at 
> org.mockito.internal.creation.jmock.ClassImposterizer.imposterise(ClassImposterizer.java:51)
> [info]   at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:52)
> [info]   at org.mockito.internal.MockitoCore.mock(MockitoCore.java:41)
> [info]   at org.mockito.Mockito.mock(Mockito.java:1014)
> [info]   at org.mockito.Mockito.mock(Mockito.java:909)
> [info]   at 
> org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply$mcV$sp(DriverRunnerTest.scala:50)
> [info]   at 
> org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47)
> [info]   at 
> org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:318)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWo

[jira] [Resolved] (SPARK-3637) NPE in ShuffleMapTask

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3637.
---
Resolution: Cannot Reproduce

I'm going to resolve this as "Cannot Reproduce."  If you see this exception on 
a newer Spark version, please re-open or file a new issue.

> NPE in ShuffleMapTask
> -
>
> Key: SPARK-3637
> URL: https://issues.apache.org/jira/browse/SPARK-3637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Przemyslaw Pastuszka
>
> When trying to execute spark.jobserver.WordCountExample using spark-jobserver 
> (https://github.com/ooyala/spark-jobserver) we observed that often it fails 
> with NullPointerException in ShuffleMapTask.scala. Here are full details:
> {code}
> Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 1.0 (TID 6, 
> hadoop-simple-768-worker-with-zookeeper-0): java.lang.NullPointerException: 
> \njava.nio.ByteBuffer.wrap(ByteBuffer.java:392)\n
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)\n  
>   
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)\n  
>   org.apache.spark.scheduler.Task.run(Task.scala:54)\n
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)\n   
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n
> java.lang.Thread.run(Thread.java:745)\nDriver stacktrace:",
> "errorClass": "org.apache.spark.SparkException",
> "stack": 
> ["org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)",
>  
> "scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
>  "scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)", 
> "org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)",
>  "scala.Option.foreach(Option.scala:236)", 
> "org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)",
>  
> "org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)",
>  "akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)", 
> "akka.actor.ActorCell.invoke(ActorCell.scala:456)", 
> "akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)", 
> "akka.dispatch.Mailbox.run(Mailbox.scala:219)", 
> "akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)",
>  "scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)", 
> "scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)",
>  "scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)", 
> "scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"
> {code}
> I am aware, that this failure may be due to the job being ill-defined by 
> spark-jobserver (I don't know if that's the case), but if so, then it should 
> be handled more gratefully on spark side.
> What's also important, that this issue doesn't happen always, which may 
> indicate some type of race condition in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4603) EOF when broadcasting a dict with an empty string value.

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4603.
---
Resolution: Cannot Reproduce

I'm going to resolve this as "Cannot Reproduce" for now.  Please re-open if you 
have more information.

> EOF when broadcasting a dict with an empty string value.
> 
>
> Key: SPARK-4603
> URL: https://issues.apache.org/jira/browse/SPARK-4603
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: OSX 10.10
>Reporter: Alex Angelini
>
> Steps to reproduce:
> 1. Broadcast {'a': ''}
> 2. Try to read the value of the broadcast
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.3.0-SNAPSHOT
>   /_/
> Using Python version 2.7.8 (default, Oct 19 2014 16:02:00)
> SparkContext available as sc.
> In [1]: sc
> Out[1]: 
> In [2]: b = sc.broadcast({'a': ''})
> In [3]: b.value
> ---
> EOFError  Traceback (most recent call last)
>  in ()
> > 1 b.value
> /Users/alexangelini/src/starscream/spark/current/python/pyspark/broadcast.pyc 
> in value(self)
>  75 if not hasattr(self, "_value") and self.path is not None:
>  76 ser = LargeObjectSerializer()
> ---> 77 self._value = ser.load_stream(open(self.path)).next()
>  78 return self._value
>  79
> /Users/alexangelini/src/starscream/spark/current/python/pyspark/serializers.pyc
>  in load_stream(self, stream)
> 615 yield value
> 616 elif type == 'P':
> --> 617 yield cPickle.load(stream)
> 618 else:
> 619 raise ValueError("unknown type: %s" % type)
> EOFError:
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2202.
---
Resolution: Cannot Reproduce

I'm going to resolve this as "Cannot Reproduce" since it's really old.  Please 
re-open or file a new issue if you're still observing this problem in newer 
Spark versions.

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
> Attachments: spark_trace.1.txt, spark_trace.2.txt
>
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-664) Accumulator updates should get locally merged before sent to the driver

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-664:
-
Issue Type: Improvement  (was: Bug)

> Accumulator updates should get locally merged before sent to the driver
> ---
>
> Key: SPARK-664
> URL: https://issues.apache.org/jira/browse/SPARK-664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Minor
>
> Whenever a task finishes, the accumulator updates from that task are 
> immediately sent back to the driver.  When the accumulator updates are big, 
> this is inefficient because (a) a lot more data has to be sent to the driver 
> and (b) the driver has to do all the work of merging the updates together.
> Probably doesn't matter for small accumulators / low number of tasks, but if 
> both are big, this could be a big bottleneck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate

2015-02-17 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324986#comment-14324986
 ] 

Florian Verhein commented on SPARK-5851:


That makes sense.

Yeah, I ran into it yesterday. My spark-ec2/setup.sh failed (had set -u set on 
a new component I was testing), resulting in looping over setup.sh calls. 
In this case, spark_ec2.py shouldn't retry, but fail gracefully (ideally after 
performing cleanup of the cluster, and returning a failure code)

> spark_ec2.py ssh failure retry handling not always appropriate
> --
>
> Key: SPARK-5851
> URL: https://issues.apache.org/jira/browse/SPARK-5851
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Florian Verhein
>Priority: Minor
>
> The following function doesn't distinguish between the ssh failing (e.g. 
> presumably a connection issue) and the remote command that it executes 
> failing (e.g. setup.sh). The latter should probably not result in a retry. 
> Perhaps tries could be an argument that is set to 1 for certain usages. 
> # Run a command on a host through ssh, retrying up to five times
> # and then throwing an exception if ssh continues to fail.
> spark-ec2: [{{def ssh(host, opts, 
> command)}}|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L953-L975]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3644:
--
Issue Type: New Feature  (was: Bug)

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5871) Explain in python should output using python

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5871.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4658
[https://github.com/apache/spark/pull/4658]

> Explain in python should output using python
> 
>
> Key: SPARK-5871
> URL: https://issues.apache.org/jira/browse/SPARK-5871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.3.0
>
>
> Instead of relying on the println in scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4571) History server shows negative time

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4571:
-
Priority: Minor  (was: Major)

> History server shows negative time
> --
>
> Key: SPARK-4571
> URL: https://issues.apache.org/jira/browse/SPARK-4571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Priority: Minor
> Attachments: Screen Shot 2014-11-21 at 2.49.25 PM.png
>
>
> See attachment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4738) Update the netty-3.x version in spark-assembly-*.jar

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4738:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Update the netty-3.x version in spark-assembly-*.jar
> 
>
> Key: SPARK-4738
> URL: https://issues.apache.org/jira/browse/SPARK-4738
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Tobias Pfeiffer
>Priority: Minor
>
> It seems as if the version of akka-remote (2.2.3-shaded-protobuf) that is 
> bundled in the spark-assembly-1.1.1-hadoop2.4.0.jar file pulls in an ancient 
> version of netty, namely io.netty:netty:3.6.6.Final (using the package 
> org.jboss.netty). This means that when using spark-submit, there will always 
> be this netty version on the classpath before any versions added by the user. 
> This may lead to issues with other packages that depend on newer versions and 
> may fail with java.lang.NoSuchMethodError etc.(finagle-http in my case).
> I wonder if it possible to manually include a newer netty version, like 
> netty-3.8.0.Final.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4863) Suspicious exception handlers

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4863:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Reporting these is interesting, but what's needed is to investigate whether 
they imply a change is necessary, and open a PR for those that do. Can you take 
that step too?

> Suspicious exception handlers
> -
>
> Key: SPARK-4863
> URL: https://issues.apache.org/jira/browse/SPARK-4863
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.1
>Reporter: Ding Yuan
>Priority: Minor
>
> Following up with the discussion in 
> https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to 
> report the suspicious exception handlers detected by our tool aspirator on 
> spark-1.1.1. 
> {noformat}
> ==
> WARNING: TODO;  in handler.
>   Line: 129, File: "org/apache/thrift/transport/TNonblockingServerSocket.java"
> 122:  public void registerSelector(Selector selector) {
> 123:try {
> 124:  // Register the server socket channel, indicating an interest in
> 125:  // accepting new connections
> 126:  serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
> 127:} catch (ClosedChannelException e) {
> 128:  // this shouldn't happen, ideally...
> 129:  // TODO: decide what to do with this.
> 130:}
> 131:  }
> ==
> ==
> WARNING: TODO;  in handler.
>   Line: 1583, File: "org/apache/spark/SparkContext.scala"
> 1578: val scheduler = try {
> 1579:   val clazz = 
> Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
> 1580:   val cons = clazz.getConstructor(classOf[SparkContext])
> 1581:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
> 1582: } catch {
> 1583:   // TODO: Enumerate the exact reasons why it can fail
> 1584:   // But irrespective of it, it means we cannot proceed !
> 1585:   case e: Exception => {
> 1586: throw new SparkException("YARN mode not available ?", e)
> 1587:   }
> ==
> ==
> WARNING 1: empty handler for exception: java.lang.Exception
> THERE IS NO LOG MESSAGE!!!
>   Line: 75, File: "org/apache/spark/repl/ExecutorClassLoader.scala"
> try {
>   val pathInDirectory = name.replace('.', '/') + ".class"
>   val inputStream = {
> if (fileSystem != null) {
>   fileSystem.open(new Path(directory, pathInDirectory))
> } else {
>   if (SparkEnv.get.securityManager.isAuthenticationEnabled()) {
> val uri = new URI(classUri + "/" + urlEncode(pathInDirectory))
> val newuri = Utils.constructURIForAuthentication(uri, 
> SparkEnv.get.securityManager)
> newuri.toURL().openStream()
>   } else {
> new URL(classUri + "/" + urlEncode(pathInDirectory)).openStream()
>   }
> }
>   }
>   val bytes = readAndTransformClass(name, inputStream)
>   inputStream.close()
>   Some(defineClass(name, bytes, 0, bytes.length))
> } catch {
>   case e: Exception => None
> }
> ==
> ==
> WARNING 1: empty handler for exception: java.io.IOException
> THERE IS NO LOG MESSAGE!!!
>   Line: 275, File: "org/apache/spark/util/Utils.scala"
>   try {
> dir = new File(root, "spark-" + UUID.randomUUID.toString)
> if (dir.exists() || !dir.mkdirs()) {
>   dir = null
> }
>   } catch { case e: IOException => ; }
> ==
> ==
> WARNING 1: empty handler for exception: java.lang.InterruptedException
> THERE IS NO LOG MESSAGE!!!
>   Line: 172, File: "parquet/org/apache/thrift/server/TNonblockingServer.java"
>   protected void joinSelector() {
> // wait until the selector thread exits
> try {
>   selectThread_.join();
> } catch (InterruptedException e) {
>   // for now, just silently ignore. technically this means we'll have 
> less of
>   // a graceful shutdown as a result.
> }
>   }
> ==
> ==
> WARNING 2: empty handler for exception: java.net.SocketException
> There are log messages..
>   Line: 111, File: 
> "parquet/org/apache/thrift/transport/TNonblockingSocket.java"
>   public void setTimeout(int timeout) {
> try {
>   socketChannel_.socket().setSoTimeout(timeout);
> } catch (SocketException sx) {
>   LOGGER.warn("Could not set socket time

[jira] [Commented] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)

2015-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324941#comment-14324941
 ] 

Sean Owen commented on SPARK-4941:
--

Can you clarify what you expect to be uploaded and what is being uploaded, and 
what the problem is?

> Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)
> --
>
> Key: SPARK-4941
> URL: https://issues.apache.org/jira/browse/SPARK-4941
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Gurpreet Singh
>
> I am specifying additional jars and config xml file with --jars and --files 
> option to be uploaded to driver in the following spark-submit command. 
> However they are not getting uploaded.
> This results in the the job failure. It was working in spark 1.0.2 build.
> Spark-Build being used (spark-1.2.0.tgz)
> 
> $SPARK_HOME/bin/spark-submit \
> --class com.ebay.inc.scala.testScalaXML \
> --driver-class-path 
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar
>  \
> --master yarn \
> --deploy-mode cluster \
> --num-executors 3 \
> --driver-memory 1G  \
> --executor-memory 1G \
> /export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar 
> /export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \
> --queue hdmi-spark \
> --jars 
> /export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\
> --files 
> /export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster 
> with 2026 NodeManagers
> 14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not 
> requested more than the maximum memory capability of the cluster (16384 MB 
> per container)
> 14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB 
> memory including 384 MB overhead
> 14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for 
> our AM
> 14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container
> 14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
> 6623380 for b_incdata_rw on 10.115.201.75:8020
> 14/12/22 23:00:21 INFO yarn.Client: 
> Uploading resource 
> file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar
>  -> 
> hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar
> 14/12/22 23:00:24 INFO yarn.Client: Uploading resource 
> file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar 
> -> 
> hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar
> 14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our 
> AM container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4898) Replace cloudpickle with Dill

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4898:
-
Issue Type: Improvement  (was: Bug)

> Replace cloudpickle with Dill
> -
>
> Key: SPARK-4898
> URL: https://issues.apache.org/jira/browse/SPARK-4898
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh Rosen
>
> We should consider replacing our modified version of {{cloudpickle}} with 
> [Dill|https://github.com/uqfoundation/dill], since it supports both Python 2 
> and 3 and might do a better job of handling certain corner-cases.
> I attempted to do this a few months ago but ran into cases where Dill had 
> issues pickling objects defined in doctests, which broke our test suite: 
> https://github.com/uqfoundation/dill/issues/50.  This issue may have been 
> resolved now; I haven't checked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4172) Progress API in Python

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4172.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3027
[https://github.com/apache/spark/pull/3027]

> Progress API in Python
> --
>
> Key: SPARK-4172
> URL: https://issues.apache.org/jira/browse/SPARK-4172
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.3.0
>
>
> The poll based progress API for Python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324926#comment-14324926
 ] 

Joseph K. Bradley edited comment on SPARK-5809 at 2/17/15 9:32 PM:
---

I see, that makes sense.  I'd recommend using something like 100 to 1000 
features for trees, but you could experiment.


was (Author: josephkb):
I see, that makes sense.  I'd recommend using something like 100 to 1000 
features for trees, but you could experiment.  I'll close the JIRA.

> OutOfMemoryError in logDebug in RandomForest.scala
> --
>
> Key: SPARK-5809
> URL: https://issues.apache.org/jira/browse/SPARK-5809
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Devesh Parekh
>Assignee: Joseph K. Bradley
>Priority: Minor
>  Labels: easyfix
>
> When training a GBM on sparse vectors produced by HashingTF, I get the 
> following OutOfMemoryError, where RandomForest is building a debug string to 
> log.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3326)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
> )
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
> at 
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
> )
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> at 
> scala.collection.AbstractTraversable.addString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
> at 
> org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
> at 
> org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>  
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> A workaround until this is fixed is to modify log4j.properties in the conf 
> directory to filter out debug logs in RandomForest. For example:
> log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324926#comment-14324926
 ] 

Joseph K. Bradley commented on SPARK-5809:
--

I see, that makes sense.  I'd recommend using something like 100 to 1000 
features for trees, but you could experiment.  I'll close the JIRA.

> OutOfMemoryError in logDebug in RandomForest.scala
> --
>
> Key: SPARK-5809
> URL: https://issues.apache.org/jira/browse/SPARK-5809
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Devesh Parekh
>Assignee: Joseph K. Bradley
>Priority: Minor
>  Labels: easyfix
>
> When training a GBM on sparse vectors produced by HashingTF, I get the 
> following OutOfMemoryError, where RandomForest is building a debug string to 
> log.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3326)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
> )
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
> at 
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
> )
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> at 
> scala.collection.AbstractTraversable.addString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
> at 
> org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
> at 
> org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>  
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> A workaround until this is fixed is to modify log4j.properties in the conf 
> directory to filter out debug logs in RandomForest. For example:
> log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5868) Python UDFs broken by analysis check in HiveContext

2015-02-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5868.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4657
[https://github.com/apache/spark/pull/4657]

> Python UDFs broken by analysis check in HiveContext
> ---
>
> Key: SPARK-5868
> URL: https://issues.apache.org/jira/browse/SPARK-5868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Technically they are broken in SQLContext as well, but because of the hacky 
> way we handle python udfs there it doesn't get checked.  Lets fix both things.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324907#comment-14324907
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

Is there an example output for `describe` you have in mind ? And I am not sure 
it'll be easy to list all the clusters as spark-ec2 looks up clusters by the 
security group / cluster-id ?

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5025) Write a guide for creating well-formed packages for Spark

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5025:
-
Issue Type: Improvement  (was: Bug)

> Write a guide for creating well-formed packages for Spark
> -
>
> Key: SPARK-5025
> URL: https://issues.apache.org/jira/browse/SPARK-5025
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> There are an increasing number of OSS projects providing utilities and 
> extensions to Spark. We should write a guide in the Spark docs that explains 
> how to create, package, and publish a third party Spark library. There are a 
> few issues here such as how to list your dependency on Spark, how to deal 
> with your own third party dependencies, etc. We should also cover how to do 
> this for Python libraries.
> In general, we should make it easy to build extension points against any of 
> Spark's API's (e.g. for new data sources, streaming receivers, ML algos, etc) 
> and self-publish libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5874) How to improve the current ML pipeline API?

2015-02-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5874:
-
Description: I created this JIRA to collect feedbacks about the ML pipeline 
API we introduced in Spark 1.2. The target is to graduate this set of APIs in 
1.4 with confidence, which requires valuable input from the community. I'll 
create sub-tasks for each major issue.  (was: I create this JIRA to collect 
feedbacks about the ML pipeline API we introduced in Spark 1.2. The target is 
to graduate this set of APIs in 1.4 with confidence, which requires valuable 
input from the community. I'll create sub-tasks for each major issue.)

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5874) How to improve the current ML pipeline API?

2015-02-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5874:


 Summary: How to improve the current ML pipeline API?
 Key: SPARK-5874
 URL: https://issues.apache.org/jira/browse/SPARK-5874
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


I create this JIRA to collect feedbacks about the ML pipeline API we introduced 
in Spark 1.2. The target is to graduate this set of APIs in 1.4 with 
confidence, which requires valuable input from the community. I'll create 
sub-tasks for each major issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5076) Don't show "Cores" or "Memory Per Node" columns for completed applications

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5076:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Don't show "Cores" or "Memory Per Node" columns for completed applications
> --
>
> Key: SPARK-5076
> URL: https://issues.apache.org/jira/browse/SPARK-5076
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Master web UI, I don't think that it makes sense to show "Cores" and 
> "Memory per Node" for completed applications; the current behavior may be 
> confusing to users: 
> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201412.mbox/%3c2ad05705-f7b6-4cf2-b315-6d5483326...@qq.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324901#comment-14324901
 ] 

Apache Spark commented on SPARK-5811:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4662

> Documentation for --packages and --repositories on Spark Shell
> --
>
> Key: SPARK-5811
> URL: https://issues.apache.org/jira/browse/SPARK-5811
> Project: Spark
>  Issue Type: Documentation
>  Components: Deploy, Spark Shell
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Priority: Critical
> Fix For: 1.3.0
>
>
> Documentation for the new support for dependency management using maven 
> coordinates using --packages and --repositories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5076) Don't show "Cores" or "Memory Per Node" columns for completed applications

2015-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324899#comment-14324899
 ] 

Sean Owen commented on SPARK-5076:
--

Should we roll this into SPARK-5771? the PR resolution is to remove the current 
cores column, and could remove the memory column too.

> Don't show "Cores" or "Memory Per Node" columns for completed applications
> --
>
> Key: SPARK-5076
> URL: https://issues.apache.org/jira/browse/SPARK-5076
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Josh Rosen
>
> In the Master web UI, I don't think that it makes sense to show "Cores" and 
> "Memory per Node" for completed applications; the current behavior may be 
> confusing to users: 
> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201412.mbox/%3c2ad05705-f7b6-4cf2-b315-6d5483326...@qq.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5198.
--
Resolution: Not a Problem

OP requested to resolve this as NotAProblem in comments.

> Change executorId more unique on mesos fine-grained mode
> 
>
> Key: SPARK-5198
> URL: https://issues.apache.org/jira/browse/SPARK-5198
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Jongyoul Lee
>Priority: Minor
> Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
> 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png
>
>
> In fine-grained mode, SchedulerBackend set executor name as same as slave id 
> with any task id. It's not good to track aspecific job because of logging a 
> different in a same log file. This is a same value while launching job on 
> coarse-grained mode.
> !Screen Shot 2015-01-12 at 11.14.39 AM.png!
> !Screen Shot 2015-01-12 at 11.34.30 AM.png!
> !Screen Shot 2015-01-12 at 11.34.41 AM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5198:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Change executorId more unique on mesos fine-grained mode
> 
>
> Key: SPARK-5198
> URL: https://issues.apache.org/jira/browse/SPARK-5198
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Jongyoul Lee
>Priority: Minor
> Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
> 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png
>
>
> In fine-grained mode, SchedulerBackend set executor name as same as slave id 
> with any task id. It's not good to track aspecific job because of logging a 
> different in a same log file. This is a same value while launching job on 
> coarse-grained mode.
> !Screen Shot 2015-01-12 at 11.14.39 AM.png!
> !Screen Shot 2015-01-12 at 11.34.30 AM.png!
> !Screen Shot 2015-01-12 at 11.34.41 AM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5230) Print usage for spark-submit and spark-class in Windows

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5230:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Print usage for spark-submit and spark-class in Windows
> ---
>
> Key: SPARK-5230
> URL: https://issues.apache.org/jira/browse/SPARK-5230
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Priority: Minor
>
> We currently only print the usage in `bin/spark-shell2.cmd`. We should do it 
> for `bin/spark-submit2.cmd` and `bin/spark-class2.cmd` too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324888#comment-14324888
 ] 

Nicholas Chammas commented on SPARK-5629:
-

cc [~joshrosen] / [~shivaram]

I see that we already have a {{get-master}} action which will probably serve 
most use cases where spark-ec2 is being used as part of some automated pipeline 
(e.g. spark-perf testing). Typically, you just want the master address so you 
can ssh in and do stuff.

Still, I'm looking for your initial reaction to this proposal.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5325) Simplifying Hive shim implementation

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5325:
-
Issue Type: Improvement  (was: Bug)

> Simplifying Hive shim implementation
> 
>
> Key: SPARK-5325
> URL: https://issues.apache.org/jira/browse/SPARK-5325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>
> The Hive shim layer introduced in Spark 1.2.0 brings maintenance burden. On 
> the other hand, many of those methods in the shim layer can be re-implemented 
> in a backwards compatible way, or replaced with simple reflection tricks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5533) Replace explicit dependency on org.codehaus.jackson

2015-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324875#comment-14324875
 ] 

Sean Owen commented on SPARK-5533:
--

Is this resolvable as NotAProblem? Spark itself does not use 
org.codehaus.jackson

> Replace explicit dependency on org.codehaus.jackson
> ---
>
> Key: SPARK-5533
> URL: https://issues.apache.org/jira/browse/SPARK-5533
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> We should use the newer com.fasterxml.jackson, which we currently also 
> include and use as a dependency from Tachyon. Instead of having both versions 
> magically work, we should clean up the dependency structure to make sure we 
> only use one version of Jackson.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5330:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core 
> :jackson-core:2.3.1 causes compatibility issues
> ---
>
> Key: SPARK-5330
> URL: https://issues.apache.org/jira/browse/SPARK-5330
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. 
> Users of jackson-module-scala had to to depend on the same version to avoid 
> any class compatibility issues. However, since scala 2.11, 
> jackson-module-scala is no longer published for version 2.3.1. Since the 
> version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest 
> jackson-core. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues

2015-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324871#comment-14324871
 ] 

Sean Owen commented on SPARK-5330:
--

What version do you suggest? can you example mvn dependency:tree to see what 
version is probably safest to update to? can you open a PR to update?

> Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core 
> :jackson-core:2.3.1 causes compatibility issues
> ---
>
> Key: SPARK-5330
> URL: https://issues.apache.org/jira/browse/SPARK-5330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Aniket Bhatnagar
>
> Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. 
> Users of jackson-module-scala had to to depend on the same version to avoid 
> any class compatibility issues. However, since scala 2.11, 
> jackson-module-scala is no longer published for version 2.3.1. Since the 
> version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest 
> jackson-core. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5369) remove allocatedHostToContainersMap.synchronized in YarnAllocator

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5369.
--
Resolution: Duplicate

> remove allocatedHostToContainersMap.synchronized in YarnAllocator
> -
>
> Key: SPARK-5369
> URL: https://issues.apache.org/jira/browse/SPARK-5369
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> as SPARK-1714 mentioned, because YarnAllocator.allocateResources is a 
> synchronized method, we can remove allocatedHostToContainersMap.synchronized  
> in YarnAllocator.allocateResources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5873) Can't see partially analyzed plans

2015-02-17 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-5873:
---

 Summary: Can't see partially analyzed plans
 Key: SPARK-5873
 URL: https://issues.apache.org/jira/browse/SPARK-5873
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


Our analysis checks are great for users who make mistakes but make it 
impossible to see what is going wrong when there is a bug in the analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5372) Change the default storage level of window operators

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5372:
-
  Priority: Minor  (was: Major)
Issue Type: Task  (was: Bug)

> Change the default storage level of window operators
> 
>
> Key: SPARK-5372
> URL: https://issues.apache.org/jira/browse/SPARK-5372
> Project: Spark
>  Issue Type: Task
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current storage level of window operators is MEMORY_ONLY_SER, if the memory 
> is not enough to hold all the window data, cached RDD will be discarded, 
> which will lead to unexpected behavior. 
> Besides the default storage level of input data is MEMORY_AND_DISK_SER_2, it 
> is better to align to this storage level to change the storage level of 
> window operators to MEMORY_AND_DISK_SER.
> This changing has no effect when memory is enough. So I'd propose to change 
> the default storage level to MEMORY_AND_DISK_SER.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5519) Add user guide for FP-Growth

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324865#comment-14324865
 ] 

Apache Spark commented on SPARK-5519:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4661

> Add user guide for FP-Growth
> 
>
> Key: SPARK-5519
> URL: https://issues.apache.org/jira/browse/SPARK-5519
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We need to add a section for FP-Growth in the user guide after we merge the 
> FP-Growth PR is merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5386) Reduce fails with vectors of big length

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5386.
--
  Resolution: Not a Problem
   Fix Version/s: (was: 1.3.0)
Target Version/s:   (was: 1.3.0)

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.count()
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5394) kafka link in streaming docs goes to nowhere

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5394.
--
Resolution: Duplicate

> kafka link in streaming docs goes to nowhere
> 
>
> Key: SPARK-5394
> URL: https://issues.apache.org/jira/browse/SPARK-5394
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Streaming
>Affects Versions: 1.2.0
>Reporter: Jon Haddad
>
> The link to the kafka example on this page 
> https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html is 
> broken.
> "See the API docs and the example."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5455) Add MultipleTransformer abstract class

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5455:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Add MultipleTransformer abstract class
> --
>
> Key: SPARK-5455
> URL: https://issues.apache.org/jira/browse/SPARK-5455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> There's an example of UnaryTransformer abstract class. Need to make public 
> MultipleTransformer class that would accepts multiple columns as input and 
> produce a single output column (e.g. from [col1,col2,col3,...] => 
> Vector(col1,col2, col3,..) or mean([col1,col2,col3,...])).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-17 Thread Devesh Parekh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324862#comment-14324862
 ] 

Devesh Parekh commented on SPARK-5809:
--

This was a naive run of GBM on TFIDF vectors produced by HashingTF 
(https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF),
 which creates 2^20 features (more than a million). What is the maximum number 
of features that GradientBoostedTrees will work for? I'll do a dimensionality 
reduction before trying again.

> OutOfMemoryError in logDebug in RandomForest.scala
> --
>
> Key: SPARK-5809
> URL: https://issues.apache.org/jira/browse/SPARK-5809
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Devesh Parekh
>Assignee: Joseph K. Bradley
>Priority: Minor
>  Labels: easyfix
>
> When training a GBM on sparse vectors produced by HashingTF, I get the 
> following OutOfMemoryError, where RandomForest is building a debug string to 
> log.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3326)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
> )
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
> at 
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
> )
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> at 
> scala.collection.AbstractTraversable.addString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
> at 
> org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
> at 
> org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>  
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> A workaround until this is fixed is to modify log4j.properties in the conf 
> directory to filter out debug logs in RandomForest. For example:
> log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5487) Dockerfile to build spark's custom akka.

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5487:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Dockerfile to build spark's custom akka.
> 
>
> Key: SPARK-5487
> URL: https://issues.apache.org/jira/browse/SPARK-5487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: jay vyas
>Priority: Minor
>
> Building spark's custom shaed akka version is tricky.  The code is in 
> https://github.com/pwendell/akka/ (branch = 2.2.3-shaded-proto) , however, 
> when attempting to build, I receive some strange errors.
> I've attempted to fork off of a Dockerfile for {{SBT 0.12.4}}, which I'll 
> attach in a snippet just as an example of what we might want to facilitate 
> building the spark specific akka until SPARK-5293 is completed.
> {noformat}
> [info] Compiling 6 Scala sources and 1 Java source to 
> /tmp/akka/akka-multi-node-testkit/target/classes...
> [warn] Class com.google.protobuf.MessageLite not found - continuing with a 
> stub.
> [error] error while loading ProtobufDecoder, class file 
> '/root/.ivy2/cache/io.netty/netty/bundles/netty-3.6.6.Final.jar(org/jboss/netty/handler/codec/protobuf/ProtobufDecoder.class)'
>  is broken
> [error] (class java.lang.NullPointerException/null)
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testconductor/RemoteConnection.scala:24:
>  org.jboss.netty.handler.codec.protobuf.ProtobufDecoder does not have a 
> constructor
> [error] val proto = List(new ProtobufEncoder, new 
> ProtobufDecoder(TestConductorProtocol.Wrapper.getDefaultInstance))
> [error]   ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:267:
>  value await is not a member of 
> scala.concurrent.Future[Iterable[akka.remote.testconductor.RoleName]]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   testConductor.getNodes.await.filterNot(_ == myself).isEmpty
> [error]  ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:354:
>  value await is not a member of scala.concurrent.Future[akka.actor.Address]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   def node(role: RoleName): ActorPath = 
> RootActorPath(testConductor.getAddressFor(role).await)
> [error]   
>   ^
> [warn] one warning found
> [error] four errors found
> [info] Updating {file:/tmp/akka/}akka-docs...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-contrib...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-sample-osgi-dining-hakkers-core...
> [info] Done updating.
> [info] Compiling 17 Scala sources to /tmp/akka/akka-cluster/target/classes...
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:59:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.GossipEnvelope
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]   case m: GossipEnvelope ? compress(gossipEnvelopeToProto(m))
> [error]  ^
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:61:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.MetricsGossipEnvelope
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]   case m: MetricsGossipEnvelope ? 
> compress(metricsGossipEnvelopeToProto(m))
> [error]   
>  ^
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:63:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.Welcome
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]   case InternalClusterAction.Welcome(from, gossip) ? 
> compress(msg.Welcome(uniqueAddressToProto(from), gossipToProto(gossip)))
> [error]   
>^
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:257:
>  type mismatch;
> [error]  found   : com.google.protobuf_spark.ByteString
> [error]  required: com.google.protobuf.ByteString
> [error]   
> msg.NodeMetrics.Number(msg.NodeMetrics.NumberType.Serialized, None, None, 
> Some(Byt

[jira] [Commented] (SPARK-4454) Race condition in DAGScheduler

2015-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324858#comment-14324858
 ] 

Apache Spark commented on SPARK-4454:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4660

> Race condition in DAGScheduler
> --
>
> Key: SPARK-4454
> URL: https://issues.apache.org/jira/browse/SPARK-4454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.0
>Reporter: Rafal Kwasny
>Assignee: Josh Rosen
>Priority: Critical
>
> It seems to be a race condition in DAGScheduler that manifests on jobs with 
> high concurrency:
> {noformat}
>  Exception in thread "main" java.util.NoSuchElementException: key not found: 
> 35
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
> at 
> org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
> at 
> 

[jira] [Updated] (SPARK-5541) Allow running Maven or SBT in run-tests

2015-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5541:
-
Issue Type: Improvement  (was: Bug)

> Allow running Maven or SBT in run-tests
> ---
>
> Key: SPARK-5541
> URL: https://issues.apache.org/jira/browse/SPARK-5541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Nicholas Chammas
>
> It would be nice if we had a hook for the spark test scripts to run with 
> Maven in addition to running with SBT. Right now it is difficult for us to 
> test pull requests in maven and we get master build breaks because of it. A 
> simple first step is to modify run-tests to allow building with maven. Then 
> we can add a second PRB that invokes this maven build. I would just add an 
> env var called SPARK_BUILD_TOOL that can be set to "sbt" or "mvn". And make 
> sure the associated logic works in either case. If we don't want to have the 
> fancy "SQL" only stuff in Maven, that's fine too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4544) Spark JVM Metrics doesn't have context.

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4544:
--
Comment: was deleted

(was: User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4660)

> Spark JVM Metrics doesn't have context.
> ---
>
> Key: SPARK-4544
> URL: https://issues.apache.org/jira/browse/SPARK-4544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sreepathi Prasanna
>
> If we enable jvm metrics for executor, master, worker, driver instances, we 
> don't have context where they are coming from ?
> This can be a issue if we are collecting all the metrics from different 
> instances are storing into common datastore. 
> This is mainly running Spark on Yarn but i believe Spark standalone has also 
> this problems.
> It would be good if we attach some context for jvm metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5629:

Description: 
You can launch multiple clusters using spark-ec2. At some point, you might just 
want to get some information about an existing cluster.

Use cases include:
* Wanting to check something about your cluster in the EC2 web console.
* Wanting to feed information about your cluster to another tool (e.g. as 
described in [SPARK-5627]).

So, in addition to the [existing 
actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
* {{launch}}
* {{destroy}}
* {{login}}
* {{stop}}
* {{start}}
* {{get-master}}
* {{reboot-slaves}}

We add a new action, {{describe}}, which describes an existing cluster if given 
a cluster name, and all clusters if not.

Some examples:
{code}
# describes all clusters launched by spark-ec2
spark-ec2 describe
{code}

{code}
# describes cluster-1
spark-ec2 describe cluster-1
{code}

In combination with the proposal in [SPARK-5627]:
{code}
# describes cluster-3 in a machine-readable way (e.g. JSON)
spark-ec2 describe cluster-1 --machine-readable
{code}


Parallels in similar tools include:
* [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
* [{{starcluster 
listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
 from MIT StarCluster

  was:
You can launch multiple clusters using spark-ec2. At some point, you might just 
want to get some information about an existing cluster.

Use cases include:
* Wanting to check something about your cluster in the EC2 web console.
* Wanting to feed information about your cluster to another tool (e.g. as 
described in [SPARK-5627]).
  For example:

So, in addition to the [existing 
actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
* {{launch}}
* {{destroy}}
* {{login}}
* {{stop}}
* {{start}}
* {{get-master}}
* {{reboot-slaves}}

We add a new action, {{describe}}, which describes an existing cluster if given 
a cluster name, and all clusters if not.

Parallels in similar tools include:
* [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
* [{{starcluster 
listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
 from MIT StarCluster


> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-1 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >