from:"\"Jakob Odersky\""

[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-16 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060500#comment-15060500
 ] 

Jakob Odersky commented on SPARK-12350:
---

You're right, somewhere in the huge stack trace output I also see the dataframe 
displayed as a table

> VectorAssembler#transform() initially throws an exception
> -
>
> Key: SPARK-12350
> URL: https://issues.apache.org/jira/browse/SPARK-12350
> Project: Spark
>  Issue Type: Bug
>  Components: ML
> Environment: sparkShell command from sbt
>Reporter: Jakob Odersky
>
> Calling VectorAssembler.transform() initially throws an exception, subsequent 
> calls work.
> h3. Steps to reproduce
> In spark-shell,
> 1. Create a dummy dataframe and define an assembler
> {code}
> import org.apache.spark.ml.feature.VectorAssembler
> val df = sc.parallelize(List((1,2), (3,4))).toDF
> val assembler = new VectorAssembler().setInputCols(Array("_1", 
> "_2")).setOutputCol("features")
> {code}
> 2. Run
> {code}
> assembler.transform(df).show
> {code}
> Initially the following exception is thrown:
> {code}
> 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
> from /9.72.139.102:60610
> java.lang.IllegalArgumentException: requirement failed: File not found: 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Subsequent calls work:
> {code}
> +---+---+-+
> | _1| _2| features|
> +---+---+-+
> |  1|  2|[1.0,2.0]|
> |  3|  4|[3.0,4.0]|
> +---+---+-+
> {code}
> It seems as though there is some internal state that is not initialized.
> [~iyounus] originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-15 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-12350:
-

 Summary: VectorAssembler#transform() initially throws an exception
 Key: SPARK-12350
 URL: https://issues.apache.org/jira/browse/SPARK-12350
 Project: Spark
  Issue Type: Bug
  Components: ML
 Environment: sparkShell command from sbt
Reporter: Jakob Odersky


Calling VectorAssembler.transform() initially throws an exception, subsequent 
calls work.

h3. Steps to reproduce
In spark-shell,
1. Create a dummy dataframe and define an assembler
{code}
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(List((1,2), (3,4))).toDF
val assembler = new VectorAssembler().setInputCols(Array("_1", 
"_2")).setOutputCol("features")
{code}

2. Run
{code}
assembler.transform(df).show
{code}
Initially the following exception is thrown:
{code}
15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
/classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
from /9.72.139.102:60610
java.lang.IllegalArgumentException: requirement failed: File not found: 
/classes/org/apache/spark/sql/catalyst/expressions/Object.class
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}
Subsequent calls work:
{code}
+---+---+-+
| _1| _2| features|
+---+---+-+
|  1|  2|[1.0,2.0]|
|  3|  4|[3.0,4.0]|
+---+---+-+
{code}

It seems as though there is some internal state that is not initialized.

[~iyounus] originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky

With DataFrames you loose type-safety. Depending on the language you are
using this can also be considered a drawback.

On 15 December 2015 at 15:08, Jakob Odersky  wrote:

> By using DataFrames you will not need to specify RDD operations explicity,
> instead the operations are built and optimized for by using the information
> available in the DataFrame's schema.
> The only draw-back I can think of is some loss of generality: given a
> dataframe containing types A, you will be able to include types B even if B
> is a sub-type of A. However, in real use-cases I have never run into this
> problem.
>
> I once had a related question on RDDs and DataFrames, here is the answer I
> got from Michael Armbrust:
>
> Here is how I view the relationship between the various components of
>> Spark:
>>
>>  - *RDDs - *a low level API for expressing DAGs that will be executed in
>> parallel by Spark workers
>>  - *Catalyst -* an internal library for expressing trees that we use to
>> build relational algebra and expression evaluation.  There's also an
>> optimizer and query planner than turns these into logical concepts into RDD
>> actions.
>>  - *Tungsten -* an internal optimized execution engine that can compile
>> catalyst expressions into efficient java bytecode that operates directly on
>> serialized binary data.  It also has nice low level data structures /
>> algorithms like hash tables and sorting that operate directly on serialized
>> data.  These are used by the physical nodes that are produced by the
>> query planner (and run inside of RDD operation on workers).
>>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
>> constructing dataflows that are backed by catalyst logical plans
>>  - *Datasets* - a user facing API that is similar to the RDD API for
>> constructing dataflows that are backed by catalyst logical plans
>>
>> So everything is still operating on RDDs but I anticipate most users will
>> eventually migrate to the higher level APIs for convenience and automatic
>> optimization
>>
>
> Hope that also helps you get an idea of the different concepts and their
> potential advantages/drawbacks.
>

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky

By using DataFrames you will not need to specify RDD operations explicity,
instead the operations are built and optimized for by using the information
available in the DataFrame's schema.
The only draw-back I can think of is some loss of generality: given a
dataframe containing types A, you will be able to include types B even if B
is a sub-type of A. However, in real use-cases I have never run into this
problem.

I once had a related question on RDDs and DataFrames, here is the answer I
got from Michael Armbrust:

Here is how I view the relationship between the various components of Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the
> query planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>

Hope that also helps you get an idea of the different concepts and their
potential advantages/drawbacks.

Re: ideal number of executors per machine

2015-12-15 Thread Jakob Odersky

Hi Veljko,
I would assume keeping the number of executors per machine to a minimum is
best for performance (as long as you consider memory requirements as well).
Each executor is a process that can run tasks in multiple threads. On a
kernel/hardware level, thread switches are much cheaper than process
switches and therefore having a single executor with multiple threads gives
a better over-all performance that multiple executors with less threads.

--Jakob

On 15 December 2015 at 13:07, Veljko Skarich 
wrote:

> Hi,
>
> I'm looking for suggestions on the ideal number of executors per machine.
> I run my jobs on 64G 32 core machines, and at the moment I have one
> executor running per machine, on the spark standalone cluster.
>
>  I could not find many guidelines for figuring out the ideal number of
> executors; the Spark official documentation merely recommends not having
> more than 64G per executor to avoid GC issues. Anyone have and advice on
> this?
>
> thank you.
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky

sorry typo, I meant *without* the addJar

On 14 December 2015 at 11:13, Jakob Odersky  wrote:

> > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop
> 2.4.1.but I also find something strange like this :
>
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html
> > (if i use "textFile",It can't run.)
>
> In the link you sent, there is still an
> `addJar(spark-assembly-hadoop-xx)`, can you try running your application
> with that?
>
> On 11 December 2015 at 03:08, Bonsen  wrote:
>
>> Thank you,and I find the problem is my package is test,but I write
>> package org.apache.spark.examples ,and IDEA had imported the
>> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
>> problems
>> __
>> Now , I change the package like this:
>>
>> package test
>> import org.apache.spark.SparkConf
>> import org.apache.spark.SparkContext
>> object test {
>>   def main(args: Array[String]) {
>> val conf = new
>> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
>> val sc = new SparkContext(conf)
>> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
>> doesn't work.!?
>> val rawData = sc.textFile("/home/hadoop/123.csv")
>> val secondData = rawData.flatMap(_.split(",").toString)
>> println(secondData.first)   /line 32
>> sc.stop()
>>   }
>> }
>> it causes that:
>> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>> 
>> 
>> //  219.216.65.129 is my worker computer.
>> //  I can connect to my worker computer.
>> // Spark can start successfully.
>> //  addFile is also doesn't work,the tmp file will also dismiss.
>>
>>
>>
>>
>>
>>
>> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" 
>> <[hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=25689&i=0>> wrote:
>>
>> You are trying to print an array, but anyway it will print the objectID
>>  of the array if the input is same as you have shown here. Try flatMap()
>> instead of map and check if the problem is same.
>>
>>--Himanshu
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
>> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
>> java.lang.Intege " for a long time., click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>>
>> --
>> View this message in context: Re:Re: HELP! I get "java.lang.String
>> cannot be cast to java.lang.Intege " for a long time.
>> <http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25689.html>
>>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky

> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop
2.4.1.but I also find something strange like this :

>
http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html
> (if i use "textFile",It can't run.)

In the link you sent, there is still an `addJar(spark-assembly-hadoop-xx)`,
can you try running your application with that?

On 11 December 2015 at 03:08, Bonsen  wrote:

> Thank you,and I find the problem is my package is test,but I write package
> org.apache.spark.examples ,and IDEA had imported the
> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
> problems
> __
> Now , I change the package like this:
>
> package test
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> object test {
>   def main(args: Array[String]) {
> val conf = new
> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
> doesn't work.!?
> val rawData = sc.textFile("/home/hadoop/123.csv")
> val secondData = rawData.flatMap(_.split(",").toString)
> println(secondData.first)   /line 32
> sc.stop()
>   }
> }
> it causes that:
> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> 
> 
> //  219.216.65.129 is my worker computer.
> //  I can connect to my worker computer.
> // Spark can start successfully.
> //  addFile is also doesn't work,the tmp file will also dismiss.
>
>
>
>
>
>
> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" <[hidden
> email] > wrote:
>
> You are trying to print an array, but anyway it will print the objectID
>  of the array if the input is same as you have shown here. Try flatMap()
> instead of map and check if the problem is same.
>
>--Himanshu
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
> java.lang.Intege " for a long time., click here.
> NAML
> 
>
>
>
>
>
> --
> View this message in context: Re:Re: HELP! I get "java.lang.String cannot
> be cast to java.lang.Intege " for a long time.
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

Btw, Spark 1.5 comes with support for hadoop 2.2 by default

On 11 December 2015 at 03:08, Bonsen  wrote:

> Thank you,and I find the problem is my package is test,but I write package
> org.apache.spark.examples ,and IDEA had imported the
> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
> problems
> __
> Now , I change the package like this:
>
> package test
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> object test {
>   def main(args: Array[String]) {
> val conf = new
> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
> doesn't work.!?
> val rawData = sc.textFile("/home/hadoop/123.csv")
> val secondData = rawData.flatMap(_.split(",").toString)
> println(secondData.first)   /line 32
> sc.stop()
>   }
> }
> it causes that:
> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> 
> 
> //  219.216.65.129 is my worker computer.
> //  I can connect to my worker computer.
> // Spark can start successfully.
> //  addFile is also doesn't work,the tmp file will also dismiss.
>
>
>
>
>
>
> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" <[hidden
> email] > wrote:
>
> You are trying to print an array, but anyway it will print the objectID
>  of the array if the input is same as you have shown here. Try flatMap()
> instead of map and check if the problem is same.
>
>--Himanshu
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
> java.lang.Intege " for a long time., click here.
> NAML
> 
>
>
>
>
>
> --
> View this message in context: Re:Re: HELP! I get "java.lang.String cannot
> be cast to java.lang.Intege " for a long time.
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

It looks like you have an issue with your classpath, I think it is because
you add a jar containing Spark twice: first, you have a dependency on Spark
somewhere in your build tool (this allows you to compile and run your
application), second you re-add Spark here

>  sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
doesn't work.!?

I recommend you remove that line and see if everything works.
If you have that line because you need hadoop 2.6, I recommend you build
spark against that version and publish locally with maven

[jira] [Commented] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-11 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053171#comment-15053171
 ] 

Jakob Odersky commented on SPARK-12271:
---

Ah, I didn't see there was already a fix for this

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-10 Thread Jakob Odersky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-12271:
--
Comment: was deleted

(was: I'm just starting out with Datasets, will be glad to try fixing this one)

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052017#comment-15052017
 ] 

Jakob Odersky commented on SPARK-12271:
---

What's the exact use-case and error you encounter?

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052003#comment-15052003
 ] 

Jakob Odersky edited comment on SPARK-12271 at 12/11/15 1:57 AM:
-

I'm just starting out with Datasets, will be glad to try fixing this one


was (Author: jodersky):
I'm just starting out with DataSets, will be glad to try fixing this one

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052003#comment-15052003
 ] 

Jakob Odersky commented on SPARK-12271:
---

I'm just starting out with DataSets, will be glad to try fixing this one

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Jakob Odersky

Is there any other process using port 7077?

On 10 December 2015 at 08:52, Andy Davidson 
wrote:

> Hi
>
> I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My
> job seems to run with out any problem.
>
> Kind regards
>
> Andy
>
> + /root/spark/bin/spark-submit --class
> com.pws.spark.streaming.IngestDriver --master spark://
> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
> --total-executor-cores 2 --deploy-mode cluster
> hdfs:///home/ec2-user/build/ingest-all.jar --clusterMode --dirPath week_3
>
> Running Spark using the REST application submission protocol.
>
> 15/12/10 16:46:33 WARN RestSubmissionClient: Unable to connect to server
> spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077.
>
> Warning: Master endpoint
> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077 was not a REST
> server. Falling back to legacy submission gateway instead.
>

Re: StackOverflowError when writing dataframe to table

2015-12-10 Thread Jakob Odersky

Can you give us some more info about the dataframe and caching? Ideally a
set of steps to reproduce the issue


On 9 December 2015 at 14:59, apu mishra . rr  wrote:

> The command
>
> mydataframe.write.saveAsTable(name="tablename")
>
> sometimes results in java.lang.StackOverflowError (see below for fuller
> error message).
>
> This is after I am able to successfully run cache() and show() methods on
> mydataframe.
>
> The issue is not deterministic, i.e. the same code sometimes works fine,
> sometimes not.
>
> I am running PySpark with:
>
> spark-submit --master local[*] --driver-memory 24g --executor-memory 24g
>
> Any help understanding this issue would be appreciated!
>
> Thanks, Apu
>
> Fuller error message:
>
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>
> at
> java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2281)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1428)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>
> at
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>
> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>
> at
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>
> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStrea

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky

Could you provide some more context? What is rawData?

On 10 December 2015 at 06:38, Bonsen  wrote:

> I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))"
>
> and I find:
> 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassCastException: java.lang.String cannot be
> cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
> at
> org.apache.spark.examples.SparkPi$$anonfun$1.apply(SparkPi.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25668.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[jira] [Commented] (SPARK-9502) ArrayTypes incorrect for DataFrames Java API

2015-12-07 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045984#comment-15045984
 ] 

Jakob Odersky commented on SPARK-9502:
--

Not sure if this is actually an error. As I see it, an ArrayType is something 
"implementation specific", no assumptions on the returned collection type will 
be made, except that it must be traversable.

> ArrayTypes incorrect for DataFrames Java API
> 
>
> Key: SPARK-9502
> URL: https://issues.apache.org/jira/browse/SPARK-9502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Kuldeep
>Priority: Critical
>
> With upgrade to 1.4.1 array types for DataFrames were different in our java 
> applications. I have modified JavaApplySchemaSuite to show the problem. 
> Mainly i have added a list field to the person class.
> {code:java}
>   public static class Person implements Serializable {
> private String name;
> private int age;
> private List skills;
> public String getName() {
>   return name;
> }
> public void setName(String name) {
>   this.name = name;
> }
> public int getAge() {
>   return age;
> }
> public void setAge(int age) {
>   this.age = age;
> }
> public void setSkills(List skills) {
>   this.skills = skills;
> }
> public List getSkills() { return skills; }
>   }
>   @Test
>   public void applySchema() {
> List personList = new ArrayList(2);
> List skills = new ArrayList();
> skills.add("eating");
> skills.add("sleeping");
> Person person1 = new Person();
> person1.setName("Michael");
> person1.setAge(29);
> person1.setSkills(skills);
> personList.add(person1);
> Person person2 = new Person();
> person2.setName("Yin");
> person2.setAge(28);
> person2.setSkills(skills);
> personList.add(person2);
> JavaRDD rowRDD = javaCtx.parallelize(personList).map(
>   new Function() {
> public Row call(Person person) throws Exception {
>   return RowFactory.create(person.getName(), person.getAge(), 
> person.getSkills());
> }
>   });
> List fields = new ArrayList(2);
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> false));
> fields.add(DataTypes.createStructField("age", DataTypes.IntegerType, 
> false));
> fields.add(DataTypes.createStructField("skills", 
> DataTypes.createArrayType(DataTypes.StringType), false));
> StructType schema = DataTypes.createStructType(fields);
> DataFrame df = sqlContext.applySchema(rowRDD, schema);
> df.registerTempTable("people");
> Row[] actual = sqlContext.sql("SELECT * FROM people").collect();
>   System.out.println(actual[1].get(2).getClass().getName());
>   System.out.println(actual[1].get(2) instanceof List);
> List expected = new ArrayList(2);
> expected.add(RowFactory.create("Michael", 29, skills));
> expected.add(RowFactory.create("Yin", 28, skills));
> Assert.assertEquals(expected, Arrays.asList(actual));
>   }
> {code}
> This prints 
> scala.collection.immutable.$colon$colon
> false
> java.lang.AssertionError: 
> Expected :[[Michael,29,[eating, sleeping]], [Yin,28,[eating, sleeping]]]
> Actual   :[[Michael,29,List(eating, sleeping)], [Yin,28,List(eating, 
> sleeping)]]
> Not sure if this would be usable even in scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Fastest way to build Spark from scratch

2015-12-07 Thread Jakob Odersky

make-distribution and the second code snippet both create a distribution
from a clean state. They therefore require that every source file be
compiled and that takes time (you can maybe tweak some settings or use a
newer compiler to gain some speed).

I'm inferring from your question that for your use-case deployment speed is
a critical issue, furthermore you'd like to build Spark for lots of
(every?) commit in a systematic way. In that case I would suggest you try
using the second code snippet without the `clean` task and only resort to
it if the build fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

On 23 November 2015 at 20:18, Nicholas Chammas 
wrote:

> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
> fast as possible from scratch.
>
> This is what I’m doing at the moment:
>
> ./make-distribution.sh -T 1C -Phadoop-2.6
>
> -T 1C instructs Maven to spin up 1 thread per available core. This takes
> around 20 minutes on an m3.large instance.
>
> I see that spark-ec2, on the other hand, builds Spark as follows
> 
> when you deploy Spark at a specific git commit:
>
> sbt/sbt clean assembly
> sbt/sbt publish-local
>
> This seems slower than using make-distribution.sh, actually.
>
> Is there a faster way to do this?
>
> Nick
> 
>

Re: Help with type check

2015-11-30 Thread Jakob Odersky

Hi Eyal,

what you're seeing is not a Spark issue, it is related to boxed types.

I assume 'b' in your code is some kind of java buffer, where b.getDouble()
returns an instance of java.lang.Double and not a scala.Double. Hence
muCouch is an Array[java.lang.Double], an array containing boxed doubles.

To fix your problem, change 'yield b.getDouble(i)' to 'yield
b.getDouble(i).doubleValue'

You might want to have a look at these too:
-
http://stackoverflow.com/questions/23821576/efficient-conversion-of-java-util-listjava-lang-double-to-scala-listdouble
- https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html
- http://www.scala-lang.org/api/current/index.html#scala.Double

On 30 November 2015 at 10:13, Eyal Sharon  wrote:

> Hi ,
>
> I have problem with inferring what are the types bug here
>
> I have this code fragment . it parse Json to Array[Double]
>
>
>
>
>
>
> *val muCouch = {  val e = input.filter( _.id=="mu")(0).content()  val b  = 
> e.getArray("feature_mean")  for (i <- 0 to e.getInt("features") ) yield 
> b.getDouble(i)}.toArray*
>
> Now the problem is when I want to create a dense vector  :
>
> *new DenseVector(muCouch)*
>
>
> I get the following error :
>
>
> *Error:(111, 21) type mismatch;
>  found   : Array[java.lang.Double]
>  required: Array[scala.Double] *
>
>
> Now , I probably get a workaround for that , but I want to get a deeper 
> understanding  on why it occurs
>
> p.s - I do use collection.JavaConversions._
>
> Thanks !
>
>
>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>

Datasets on experimental dataframes?

2015-11-23 Thread Jakob Odersky

Hi,

datasets are being built upon the experimental DataFrame API, does this
mean DataFrames won't be experimental in the near future?

thanks,
--Jakob

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky

Thanks Michael, that helped me a lot!

On 23 November 2015 at 17:47, Michael Armbrust 
wrote:

> Here is how I view the relationship between the various components of
> Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the query
> planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>
> On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky  wrote:
>
>> Hi everyone,
>>
>> I'm doing some reading-up on all the newer features of Spark such as
>> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
>> the relation between all these concepts.
>>
>> When starting to learn Spark, I read a book and the original paper on
>> RDDs, this lead me to basically think "Spark == RDDs".
>> Now, looking into DataFrames, I read that they are basically
>> (distributed) collections with an associated schema, thus enabling
>> declarative queries and optimization (through Catalyst). I am uncertain how
>> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs
>> once they have been optimized? Or are they completely different concepts?
>> In case of the latter, do DataFrames still use the Spark scheduler and get
>> broken down into a DAG of stages and tasks?
>>
>> Regarding project Tungsten, where does it fit in? To my understanding it
>> is used to efficiently cache data in memory and may also be used to
>> generate query code for specialized hardware. This sounds as though it
>> would work on Spark's worker nodes, however it would also only work with
>> schema-associated data (aka DataFrames), thus leading me to the conclusion
>> that RDDs and DataFrames do not share a common backend which in turn
>> contradicts my conception of "Spark == RDDs".
>>
>> Maybe I missed the obvious as these questions seem pretty basic, however
>> I was unable to find clear answers in Spark documentation or related papers
>> and talks. I would greatly appreciate any clarifications.
>>
>> thanks,
>> --Jakob
>>
>
>

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky

Hi everyone,

I'm doing some reading-up on all the newer features of Spark such as
DataFrames, DataSets and Project Tungsten. This got me a bit confused on
the relation between all these concepts.

When starting to learn Spark, I read a book and the original paper on RDDs,
this lead me to basically think "Spark == RDDs".
Now, looking into DataFrames, I read that they are basically (distributed)
collections with an associated schema, thus enabling declarative queries
and optimization (through Catalyst). I am uncertain how DataFrames relate
to RDDs: are DataFrames transformed to operations on RDDs once they have
been optimized? Or are they completely different concepts? In case of the
latter, do DataFrames still use the Spark scheduler and get broken down
into a DAG of stages and tasks?

Regarding project Tungsten, where does it fit in? To my understanding it is
used to efficiently cache data in memory and may also be used to generate
query code for specialized hardware. This sounds as though it would work on
Spark's worker nodes, however it would also only work with
schema-associated data (aka DataFrames), thus leading me to the conclusion
that RDDs and DataFrames do not share a common backend which in turn
contradicts my conception of "Spark == RDDs".

Maybe I missed the obvious as these questions seem pretty basic, however I
was unable to find clear answers in Spark documentation or related papers
and talks. I would greatly appreciate any clarifications.

thanks,
--Jakob

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-20 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15020244#comment-15020244
 ] 

Jakob Odersky commented on SPARK-7286:
--

Hey @xiao, I think we had a misunderstanding, =!= solves all the problems,
I was suggesting that <> would not solve all problems but that =!= doesn't
"look" as good (my subjective opinion ;))



> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky

that definitely looks like a bug, go ahead with filing an issue
I'll check the scala repl source code to see what, if any, other commands
there are that should be disabled

On 19 November 2015 at 12:54, Jacek Laskowski  wrote:

> Hi,
>
> Dunno the answer, but :reset should be blocked, too, for obvious reasons.
>
> ➜  spark git:(master) ✗ ./bin/spark-shell
> ...
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> :reset
> Resetting interpreter state.
> Forgetting this session history:
>
>
>  @transient val sc = {
>val _sc = org.apache.spark.repl.Main.createSparkContext()
>println("Spark context available as sc.")
>_sc
>  }
>
>
>  @transient val sqlContext = {
>val _sqlContext = org.apache.spark.repl.Main.createSQLContext()
>println("SQL context available as sqlContext.")
>_sqlContext
>  }
>
> import org.apache.spark.SparkContext._
> import sqlContext.implicits._
> import sqlContext.sql
> import org.apache.spark.sql.functions._
> ...
>
> scala> import org.apache.spark._
> import org.apache.spark._
>
> scala> val sc = new SparkContext("local[*]", "shell", new SparkConf)
> ...
> org.apache.spark.SparkException: Only one SparkContext may be running
> in this JVM (see SPARK-2243). To ignore this error, set
> spark.driver.allowMultipleContexts = true. The currently running
> SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:82)
> ...
>
> Guess I should file an issue?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Apache Spark
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Nov 19, 2015 at 8:44 PM, Jakob Odersky  wrote:
> > I was just going through the spark shell code and saw this:
> >
> > private val blockedCommands = Set("implicits", "javap", "power",
> "type",
> > "kind")
> >
> > What is the reason as to why these commands are blocked?
> >
> > thanks,
> > --Jakob
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[jira] [Comment Edited] (SPARK-7286) Precedence of operator not behaving properly

2015-11-19 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014647#comment-15014647
 ] 

Jakob Odersky edited comment on SPARK-7286 at 11/19/15 11:00 PM:
-

I just realized that <> would also have a different precedence that ===

That pretty much limits our options to 1 or 3

Thinking about option 1 again, it might actually not be that bad since 
(in-)equality comparisons typically happen between expressions using higher 
precedence characters (*/+-:%). If the dollar column operator were removed it 
would become a very rare situation.

On a side note, why was the dollar character actually used? from what I know 
its use is discouraged as it can clash with compiler generated identifiers


was (Author: jodersky):
I just realized that <> would also have a different precedence that ===

That pretty much limits our options to 1 or 3

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7286) Precedence of operator not behaving properly

2015-11-19 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014647#comment-15014647
 ] 

Jakob Odersky edited comment on SPARK-7286 at 11/19/15 11:01 PM:
-

I just realized that <> would also have a different precedence that ===

That pretty much limits our options to 1 or 3

Thinking about option 1 again, it might actually not be that bad since 
(in-)equality comparisons typically happen between expressions using higher 
precedence characters (*/+-:%). If the dollar column operator were removed it 
would become a very rare situation.

On a side note, why was the dollar character actually chosen? from what I know 
its use is discouraged as it can clash with compiler generated identifiers


was (Author: jodersky):
I just realized that <> would also have a different precedence that ===

That pretty much limits our options to 1 or 3

Thinking about option 1 again, it might actually not be that bad since 
(in-)equality comparisons typically happen between expressions using higher 
precedence characters (*/+-:%). If the dollar column operator were removed it 
would become a very rare situation.

On a side note, why was the dollar character actually used? from what I know 
its use is discouraged as it can clash with compiler generated identifiers

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7286) Precedence of operator not behaving properly

2015-11-19 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014647#comment-15014647
 ] 

Jakob Odersky edited comment on SPARK-7286 at 11/19/15 10:52 PM:
-

I just realized that <> would also have a different precedence that ===

That pretty much limits our options to 1 or 3


was (Author: jodersky):
I just realized that <> would also have a different precedence that ===

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-19 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014647#comment-15014647
 ] 

Jakob Odersky commented on SPARK-7286:
--

I just realized that <> would also have a different precedence that ===

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-19 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014613#comment-15014613
 ] 

Jakob Odersky commented on SPARK-7286:
--

If we want the same precedence of equals and non-equals operators, they should 
both start with a symbol of same-precedence.
= has the same precedence as !, however as per spec, if the operator also ends 
with a = and does not start with =, it is considered an assignment operator 
with a lower precedence than others.

So I see a couple of options:
1) abandon the wish of same precedence and accept the status-quo
2) adopt Reynold's solution or define another operator
3) remove non-equals comparison altogether, instead rely on negating equals 
comparison. I.e (A !== B becomes !(A === B))

Personally, I am against choosing option 1 as it leads to confusing behaviour 
(like this JIRA demonstrates). Not sure what I prefer between option 2 or 3 
though

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Blocked REPL commands

2015-11-19 Thread Jakob Odersky

I was just going through the spark shell code and saw this:

private val blockedCommands = Set("implicits", "javap", "power",
"type", "kind")

What is the reason as to why these commands are blocked?

thanks,
--Jakob

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012643#comment-15012643
 ] 

Jakob Odersky commented on SPARK-7286:
--

Going through the code, I saw that catalyst also defines !== in its dsl, so it 
seems this operator has quite wide-spread usage.
Would deprecating it in favor of something else be a viable option?

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7286) Precedence of operator not behaving properly

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012532#comment-15012532
 ] 

Jakob Odersky edited comment on SPARK-7286 at 11/19/15 1:16 AM:


The problem is that !== is recognized as an assignment operator (according to 
§6.12.4 of the scala specification 
http://www.scala-lang.org/docu/files/ScalaReference.pdf) and thus has lower 
precedence than any other operator.
A potential fix could be to rename !== to
 =!=


was (Author: jodersky):
The problem is that !== is recognized as an assignment operator (according to 
§6.12.4 of the scala specification 
http://www.scala-lang.org/docu/files/ScalaReference.pdf) and thus has lower 
precedence than any other operator.
A potential fix could be to rename !== to =!=

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012532#comment-15012532
 ] 

Jakob Odersky commented on SPARK-7286:
--

The problem is that !== is recognized as an assignment operator (according to 
§6.12.4 of the scala specification 
http://www.scala-lang.org/docu/files/ScalaReference.pdf) and thus has lower 
precedence than any other operator.
A potential fix could be to rename !== to =!=

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11832) Spark shell does not work from sbt with scala 2.11

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012372#comment-15012372
 ] 

Jakob Odersky commented on SPARK-11832:
---

I'm on to something, it seems as though the SBT actually passes a '-usejavacp' 
argument to the repl, however the spark shell scala 2.11 implementation ignores 
all arguments. I'm working on a fix

> Spark shell does not work from sbt with scala 2.11
> --
>
> Key: SPARK-11832
> URL: https://issues.apache.org/jira/browse/SPARK-11832
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Jakob Odersky
>Priority: Minor
>
> Using Scala 2.11, running the spark shell task from within sbt fails, however 
> running it from a distribution works.
> h3. Steps to reproduce
> # change scala version {{dev/change-scala-version.sh 2.11}}
> # start sbt {{build/sbt -Dscala-2.11}}
> # run shell task {{sparkShell}}
> h3. Stacktrace
> {code}
> Failed to initialize compiler: object scala in compiler mirror not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.NullPointerException
>   at 
> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:894)
>   at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:893)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.importsPreamble$lzycompute(IMain.scala:893)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.importsPreamble(IMain.scala:893)
>   at 
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:915)
>   at 
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1325)
>   at 
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1324)
>   at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
>   at 
> scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1324)
>   at 
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:906)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:995)
>   at scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:990)
>   at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:577)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:563)
>   at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:802)
>   at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:836)
>   at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:694)
>   at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:404)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcZ$sp(SparkILoop.scala:39)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
>   at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:213)
>   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:38)
>   at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:922)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
>   at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
>   at org.apache.spark.repl.Main$.main(Main.scala:49)
>   at org.apache.spark.repl.Main.main(Main.scala)
> {code}
> h3. Workaround
> In {{repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala}}, 
> append to the repl settings {{s.usejavacp.value = true}}
> I haven't looked into the details of {{scala.tools.nsc.Settings}}, maybe 
> someone has an idea of what's going on.
> Also, to be clear, this bug only affects scala 2.11 from within sbt; calling 
> spark-shell from a distribution or from anywhere using scala 2.10 works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11832) Spark shell does not work from sbt with scala 2.11

2015-11-18 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-11832:
-

 Summary: Spark shell does not work from sbt with scala 2.11
 Key: SPARK-11832
 URL: https://issues.apache.org/jira/browse/SPARK-11832
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Jakob Odersky
Priority: Minor


Using Scala 2.11, running the spark shell task from within sbt fails, however 
running it from a distribution works.

h3. Steps to reproduce
# change scala version {{dev/change-scala-version.sh 2.11}}
# start sbt {{build/sbt -Dscala-2.11}}
# run shell task {{sparkShell}}

h3. Stacktrace
{code}
Failed to initialize compiler: object scala in compiler mirror not found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programmatically, settings.usejavacp.value = true.
Exception in thread "main" java.lang.NullPointerException
at 
scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
at 
scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:894)
at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:893)
at 
scala.tools.nsc.interpreter.IMain$Request.importsPreamble$lzycompute(IMain.scala:893)
at 
scala.tools.nsc.interpreter.IMain$Request.importsPreamble(IMain.scala:893)
at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:915)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1325)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1324)
at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1324)
at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:906)
at 
scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:995)
at scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:990)
at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:577)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:563)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:802)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:836)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:694)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:404)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcZ$sp(SparkILoop.scala:39)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:213)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:922)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at 
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
at org.apache.spark.repl.Main$.main(Main.scala:49)
at org.apache.spark.repl.Main.main(Main.scala)
{code}

h3. Workaround
In {{repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala}}, append 
to the repl settings {{s.usejavacp.value = true}}

I haven't looked into the details of {{scala.tools.nsc.Settings}}, maybe 
someone has an idea of what's going on.
Also, to be clear, this bug only affects scala 2.11 from within sbt; calling 
spark-shell from a distribution or from anywhere using scala 2.10 works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11288) Specify the return type for UDF in Scala

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011855#comment-15011855
 ] 

Jakob Odersky commented on SPARK-11288:
---

Tell me if I'm missing the point, but you can also pass type parameters 
explicitly when calling udfs:
{code}
df.udf[ReturnType, Arg1, Arg2]((arg1, arg2) => ret)
{code}

> Specify the return type for UDF in Scala
> 
>
> Key: SPARK-11288
> URL: https://issues.apache.org/jira/browse/SPARK-11288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> The return type is figured out from the function signature, maybe it's not 
> that user want, for example, the default DecimalType is (38, 18), user may 
> want (38, 0).
> The older deprecated one callUDF can do that, we should figure out  a way to 
> support that.
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9875) Do not evaluate foreach and foreachPartition with count

2015-11-17 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009891#comment-15009891
 ] 

Jakob Odersky commented on SPARK-9875:
--

I'm not sure I understand the issue. Are you trying to force running {{func}} 
on every partition to achieve some kind of side effect?

> Do not evaluate foreach and foreachPartition with count
> ---
>
> Key: SPARK-9875
> URL: https://issues.apache.org/jira/browse/SPARK-9875
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Zoltán Zvara
>Priority: Minor
>
> It is evident, that the summation inside count will result in an overhead, 
> which would be nice to remove from the current execution.
> {{self.mapPartitions(func).count()  # Force evaluation}} @{{rdd.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2015-11-16 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007740#comment-15007740
 ] 

Jakob Odersky commented on SPARK-11765:
---

I think adding a "blacklist" of ports could lead to confusing debugging 
experiences with no real value gained.
You can always explicitly set the web UI's port with the configuration 
parameter {{spark.ui.port}}, as explained here 
http://spark.apache.org/docs/latest/configuration.html

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11688) UDF's doesn't work when it has a default arguments

2015-11-12 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003603#comment-15003603
 ] 

Jakob Odersky commented on SPARK-11688:
---

As a workaround, you could register your function twice:

{{sqlContext.udf.register("hasSubstring", (x: String, y: String) => 
hasSubstring(x, y) ) //this will use the default argument}}
{{sqlContext.udf.register("hasSubstringIndex", (x: String, y: String, z: Int) 
=> hasSubstring(x, y, z) )}}

and call them accordingly

> UDF's doesn't work when it has a default arguments
> --
>
> Key: SPARK-11688
> URL: https://issues.apache.org/jira/browse/SPARK-11688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: M Bharat lal
>Priority: Minor
>
> Use case:
> 
> Suppose we have a function which accepts three parameters (string, subString 
> and frmIndex which has 0 default value )
> def hasSubstring(string:String, subString:String, frmIndex:Int = 0): Long = 
> string.indexOf(subString, frmIndex)
> above function works perfectly if I dont pass frmIndex parameter
> scala> hasSubstring("Scala", "la")
> res0: Long = 3
> But, when I register the above function as UDF (successfully registered) and 
> call the same without  passing frmIndex parameter got the below exception
> scala> val df  = 
> sqlContext.createDataFrame(Seq(("scala","Spark","MLlib"),("abc", "def", 
> "gfh"))).toDF("c1", "c2", "c3")
> df: org.apache.spark.sql.DataFrame = [c1: string, c2: string, c3: string]
> scala> df.show
> +-+-+-+
> |   c1|   c2|   c3|
> +-+-+-+
> |scala|Spark|MLlib|
> |  abc|  def|  gfh|
> +-+-+-+
> scala> sqlContext.udf.register("hasSubstring", hasSubstring _ )
> res3: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(,LongType,List())
> scala> val result = df.as("i0").withColumn("subStringIndex", 
> callUDF("hasSubstring", $"i0.c1", lit("la")))
> org.apache.spark.sql.AnalysisException: undefined function hasSubstring;
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
>   at scala.util.Try.getOrElse(Try.scala:77)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$23.apply(Analyzer.scala:490)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$23.apply(Analyzer.scala:490)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11688) UDF's doesn't work when it has a default arguments

2015-11-12 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003600#comment-15003600
 ] 

Jakob Odersky commented on SPARK-11688:
---

Registering a UDF requires a function (instance of FunctionX), however only 
defs support default parameters. Let me illustrate:

{{hasSubstring _}}  is equivalent to {{(x: String, y: String, z: Int) => 
hasSubstring(x, y, z)}}, which is only syntactic sugar for

{code}
new Function3[Long, String, String, Int] {
  def apply(x: String, y: String, z: Int) = hasSubstring(x, y, z)
}
{code}

Therefore, the error is expected since you are trying to call a Function3 with 
only two parameters.

With the current API, and without trying some macro-magic, I see no way of 
enabling default parameters for UDFs. Maybe a changing the register API to 
something like
register((X, Y, Z) => R, defaults) could work where defaults would supply the 
arguments to any non-specified parameters when the UDF is called. However this 
could also lead to some very subtle errors as any substituted default 
parameters would have a value as specified during registration, potentially 
different from a default parameter specified in a corresponding def declaration.

> UDF's doesn't work when it has a default arguments
> --
>
> Key: SPARK-11688
> URL: https://issues.apache.org/jira/browse/SPARK-11688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: M Bharat lal
>Priority: Minor
>
> Use case:
> 
> Suppose we have a function which accepts three parameters (string, subString 
> and frmIndex which has 0 default value )
> def hasSubstring(string:String, subString:String, frmIndex:Int = 0): Long = 
> string.indexOf(subString, frmIndex)
> above function works perfectly if I dont pass frmIndex parameter
> scala> hasSubstring("Scala", "la")
> res0: Long = 3
> But, when I register the above function as UDF (successfully registered) and 
> call the same without  passing frmIndex parameter got the below exception
> scala> val df  = 
> sqlContext.createDataFrame(Seq(("scala","Spark","MLlib"),("abc", "def", 
> "gfh"))).toDF("c1", "c2", "c3")
> df: org.apache.spark.sql.DataFrame = [c1: string, c2: string, c3: string]
> scala> df.show
> +-+-+-+
> |   c1|   c2|   c3|
> +-+-+-+
> |scala|Spark|MLlib|
> |  abc|  def|  gfh|
> +-+-+-+
> scala> sqlContext.udf.register("hasSubstring", hasSubstring _ )
> res3: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(,LongType,List())
> scala> val result = df.as("i0").withColumn("subStringIndex", 
> callUDF("hasSubstring", $"i0.c1", lit("la")))
> org.apache.spark.sql.AnalysisException: undefined function hasSubstring;
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
>   at scala.util.Try.getOrElse(Try.scala:77)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$23.apply(Analyzer.scala:490)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$23.apply(Analyzer.scala:490)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky

Hi Sukant,

Regarding the first point: when building spark during my daily work, I
always use Scala 2.11 and have only run into build problems once. Assuming
a working build I have never had any issues with the resulting artifacts.

More generally however, I would advise you to go with Scala 2.11 under all
circumstances. Scala 2.10 has reached end-of-life and, from what I make out
of your question, you have the opportunity to switch to a newer technology,
so why stay with legacy? Furthermore, Scala 2.12 will be coming out early
next year, so I reckon that Spark will switch to Scala 2.11 by default
pretty soon*.

regards,
--Jakob

*I'm myself pretty new to the Spark community so please don't take my words
on it as gospel


On 11 November 2015 at 15:25, Ted Yu  wrote:

> For #1, the published jars are usable.
> However, you should build from source for your specific combination of
> profiles.
>
> Cheers
>
> On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale <
> sha...@cognitivescale.com> wrote:
>
>> Hi,
>>
>> My company isn't using Spark in production yet, but we are using a bit of
>> Scala.  There's a few people who have wanted to be conservative and keep
>> our
>> Scala at 2.10 in the event we start using Spark.  There are others who
>> want
>> to move to 2.11 with the idea that by the time we're using Spark it will
>> be
>> more or less 2.11-ready.
>>
>> It's hard to make a strong judgement on these kinds of things without
>> getting some community feedback.
>>
>> Looking through the internet I saw:
>>
>> 1) There's advice to build 2.11 packages from source -- but also published
>> jars to Maven Central for 2.11.  Are these jars on Maven Central usable
>> and
>> the advice to build from source outdated?
>>
>> 2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay
>> for
>> us, but is there anything else to worry about?
>>
>> It would be nice to get some answers to those questions as well as any
>> other
>> feedback from maintainers or anyone that's used Spark with Scala 2.11
>> beyond
>> simple examples.
>>
>> Thanks,
>> Sukant
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Slow stage?

2015-11-11 Thread Jakob Odersky

Hi Simone,
I'm afraid I don't have an answer to your question. However I noticed the
DAG figures in the attachment. How did you generate these? I am myself
working on a project in which I am trying to generate visual
representations of the spark scheduler DAG. If such a tool already exists,
I would greatly appreciate any pointers.

thanks,
--Jakob

On 9 November 2015 at 13:52, Simone Franzini  wrote:

> Hi all,
>
> I have a complex Spark job that is broken up in many stages.
> I have a couple of stages that are particularly slow: each task takes
> around 6 - 7 minutes. This stage is fairly complex as you can see from the
> attached DAG. However, by construction each of the outer joins will have
> only 0 or 1 record on each side.
> It seems to me that this stage is really slow. However, the execution
> timeline shows that almost 100% of the time is spent in actual execution
> time not reading/writing to/from disk or in other overheads.
> Does this make any sense? I.e. is it just that these operations are slow
> (and notice task size in term of data seems small)?
> Is the pattern of operations in the DAG good or is it terribly suboptimal?
> If so, how could it be improved?
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Jakob Odersky

As another, general question, are spark packages the go-to way of extending
spark functionality? In my specific use-case I would like to start spark
(be it spark-shell or other) and hook into the listener API.
Since I wasn't able to find much documentation about spark packages, I was
wondering if they are still actively being developed?

thanks,
--Jakob

On 10 November 2015 at 14:58, Jakob Odersky  wrote:

> (accidental keyboard-shortcut sent the message)
> ... spark-shell from the spark 1.5.2 binary distribution.
> Also, running "spPublishLocal" has the same effect.
>
> thanks,
> --Jakob
>
> On 10 November 2015 at 14:55, Jakob Odersky  wrote:
>
>> Hi,
>> I ran into in error trying to run spark-shell with an external package
>> that I built and published locally
>> using the spark-package sbt plugin (
>> https://github.com/databricks/sbt-spark-package).
>>
>> To my understanding, spark packages can be published simply as maven
>> artifacts, yet after running "publishLocal" in my package project (
>> https://github.com/jodersky/spark-paperui), the following command
>>
>>park-shell --packages
>> ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>>
>> gives an error:
>>
>> ::
>>
>> ::  UNRESOLVED DEPENDENCIES ::
>>
>> ::
>>
>> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default
>>
>> ::
>>
>>
>> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>> Exception in thread "main" java.lang.RuntimeException: [unresolved
>> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default]
>> at
>> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
>> at
>> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>>
>> Do I need to include some default configuration? If so where and how
>> should I do it? All other packages I looked at had no such thing.
>>
>> Btw, I am using spark-shell from a
>>
>>
>

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky

Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky

Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

(accidental keyboard-shortcut sent the message)
... spark-shell from the spark 1.5.2 binary distribution.
Also, running "spPublishLocal" has the same effect.

thanks,
--Jakob

On 10 November 2015 at 14:55, Jakob Odersky  wrote:

> Hi,
> I ran into in error trying to run spark-shell with an external package
> that I built and published locally
> using the spark-package sbt plugin (
> https://github.com/databricks/sbt-spark-package).
>
> To my understanding, spark packages can be published simply as maven
> artifacts, yet after running "publishLocal" in my package project (
> https://github.com/jodersky/spark-paperui), the following command
>
>park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>
> gives an error:
>
> ::
>
> ::  UNRESOLVED DEPENDENCIES ::
>
> ::
>
> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default
>
> ::
>
>
> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> Exception in thread "main" java.lang.RuntimeException: [unresolved
> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default]
> at
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
> at
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>
> Do I need to include some default configuration? If so where and how
> should I do it? All other packages I looked at had no such thing.
>
> Btw, I am using spark-shell from a
>
>

Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

Hi,
I ran into in error trying to run spark-shell with an external package that
I built and published locally
using the spark-package sbt plugin (
https://github.com/databricks/sbt-spark-package).

To my understanding, spark packages can be published simply as maven
artifacts, yet after running "publishLocal" in my package project (
https://github.com/jodersky/spark-paperui), the following command

   park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT

gives an error:

::

::  UNRESOLVED DEPENDENCIES ::

::

:: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default

::


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved
dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default]
at
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12

Do I need to include some default configuration? If so where and how should
I do it? All other packages I looked at had no such thing.

Btw, I am using spark-shell from a

Re: State of the Build

2015-11-06 Thread Jakob Odersky

> Can you clarify which sbt jar (by path) ?
Any of them.
Sbt is a build tool, and I don't understand why it is included in a source
repository. It would be like including make in a project.

On 6 November 2015 at 16:43, Ted Yu  wrote:

> bq. include an sbt jar in the source repo
>
> Can you clarify which sbt jar (by path) ?
>
> I tried 'git log' on the following files but didn't see commit history:
>
> ./build/sbt-launch-0.13.7.jar
> ./build/zinc-0.3.5.3/lib/sbt-interface.jar
> ./sbt/sbt-launch-0.13.2.jar
> ./sbt/sbt-launch-0.13.5.jar
>
> On Fri, Nov 6, 2015 at 4:25 PM, Jakob Odersky  wrote:
>
>> [Reposting to the list again, I really should double-check that
>> reply-to-all button]
>>
>> in the mean-time, as a light Friday-afternoon patch I was thinking about
>> splitting the ~600loc-single-build sbt file into something more manageable
>> like the Akka build (without changing any dependencies or settings). I know
>> its pretty trivial and not very important, but it might make things easier
>> to add/refactor in the future.
>>
>> Also, why do we include an sbt jar in the source repo, especially if it
>> is used only as an internal development tool?
>>
>> On 6 November 2015 at 15:29, Patrick Wendell  wrote:
>>
>>> I think there are a few minor differences in the dependency graph that
>>> arise from this. For a given user, the probability it affects them is low -
>>> it needs to conflict with a library a user application is using. However
>>> the probability it affects *some users* is very high and we do see small
>>> changes crop up fairly frequently.
>>>
>>> My feeling is mostly pragmatic... if we can get things working to
>>> standardize on Maven-style resolution by upgrading SBT, let's do it. If
>>> that's not tenable, we can evaluate alternatives.
>>>
>>> - Patrick
>>>
>>> On Fri, Nov 6, 2015 at 3:07 PM, Marcelo Vanzin 
>>> wrote:
>>>
>>>> On Fri, Nov 6, 2015 at 3:04 PM, Koert Kuipers 
>>>> wrote:
>>>> > if i understand it correctly it would cause compatibility breaks for
>>>> > applications on top of spark, because those applications use the
>>>> exact same
>>>> > current resolution logic (so basically they are maven apps), and the
>>>> change
>>>> > would make them inconsistent?
>>>>
>>>> I think Patrick said it could cause compatibility breaks because
>>>> switching to sbt's version resolution means Spark's dependency tree
>>>> would change. Just to cite the recent example, you'd get Guava 16
>>>> instead of 14 (let's ignore that Guava is currently mostly shaded in
>>>> Spark), so if your app depended transitively on Guava and used APIs
>>>> from 14 that are not on 16, it would break.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>
>>>
>>
>

Re: State of the Build

2015-11-06 Thread Jakob Odersky

[Reposting to the list again, I really should double-check that
reply-to-all button]

in the mean-time, as a light Friday-afternoon patch I was thinking about
splitting the ~600loc-single-build sbt file into something more manageable
like the Akka build (without changing any dependencies or settings). I know
its pretty trivial and not very important, but it might make things easier
to add/refactor in the future.

Also, why do we include an sbt jar in the source repo, especially if it is
used only as an internal development tool?

On 6 November 2015 at 15:29, Patrick Wendell  wrote:

> I think there are a few minor differences in the dependency graph that
> arise from this. For a given user, the probability it affects them is low -
> it needs to conflict with a library a user application is using. However
> the probability it affects *some users* is very high and we do see small
> changes crop up fairly frequently.
>
> My feeling is mostly pragmatic... if we can get things working to
> standardize on Maven-style resolution by upgrading SBT, let's do it. If
> that's not tenable, we can evaluate alternatives.
>
> - Patrick
>
> On Fri, Nov 6, 2015 at 3:07 PM, Marcelo Vanzin 
> wrote:
>
>> On Fri, Nov 6, 2015 at 3:04 PM, Koert Kuipers  wrote:
>> > if i understand it correctly it would cause compatibility breaks for
>> > applications on top of spark, because those applications use the exact
>> same
>> > current resolution logic (so basically they are maven apps), and the
>> change
>> > would make them inconsistent?
>>
>> I think Patrick said it could cause compatibility breaks because
>> switching to sbt's version resolution means Spark's dependency tree
>> would change. Just to cite the recent example, you'd get Guava 16
>> instead of 14 (let's ignore that Guava is currently mostly shaded in
>> Spark), so if your app depended transitively on Guava and used APIs
>> from 14 that are not on 16, it would break.
>>
>> --
>> Marcelo
>>
>
>

Re: State of the Build

2015-11-06 Thread Jakob Odersky

Reposting to the list...

Thanks for all the feedback everyone, I get a clearer picture of the
reasoning and implications now.

Koert, according to your post in this thread
http://apache-spark-developers-list.1001551.n3.nabble.com/Master-build-fails-tt14895.html#a15023,
it is apparently very easy to change the maven resolution mechanism to the
ivy one.
Patrick, would this not help with the problems you described?

On 5 November 2015 at 23:23, Patrick Wendell  wrote:

> Hey Jakob,
>
> The builds in Spark are largely maintained by me, Sean, and Michael
> Armbrust (for SBT). For historical reasons, Spark supports both a Maven and
> SBT build. Maven is the build of reference for packaging Spark and is used
> by many downstream packagers and to build all Spark releases. SBT is more
> often used by developers. Both builds inherit from the same pom files (and
> rely on the same profiles) to minimize maintenance complexity of Spark's
> very complex dependency graph.
>
> If you are looking to make contributions that help with the build, I am
> happy to point you towards some things that are consistent maintenance
> headaches. There are two major pain points right now that I'd be thrilled
> to see fixes for:
>
> 1. SBT relies on a different dependency conflict resolution strategy than
> maven - causing all kinds of headaches for us. I have heard that newer
> versions of SBT can (maybe?) use Maven as a dependency resolver instead of
> Ivy. This would make our life so much better if it were possible, either by
> virtue of upgrading SBT or somehow doing this ourselves.
>
> 2. We don't have a great way of auditing the net effect of dependency
> changes when people make them in the build. I am working on a fairly clunky
> patch to do this here:
>
> https://github.com/apache/spark/pull/8531
>
> It could be done much more nicely using SBT, but only provided (1) is
> solved.
>
> Doing a major overhaul of the sbt build to decouple it from pom files, I'm
> not sure that's the best place to start, given that we need to continue to
> support maven - the coupling is intentional. But getting involved in the
> build in general would be completely welcome.
>
> - Patrick
>
> On Thu, Nov 5, 2015 at 10:53 PM, Sean Owen  wrote:
>
>> Maven isn't 'legacy', or supported for the benefit of third parties.
>> SBT had some behaviors / problems that Maven didn't relative to what
>> Spark needs. SBT is a development-time alternative only, and partly
>> generated from the Maven build.
>>
>> On Fri, Nov 6, 2015 at 1:48 AM, Koert Kuipers  wrote:
>> > People who do upstream builds of spark (think bigtop and hadoop
>> distros) are
>> > used to legacy systems like maven, so maven is the default build. I
>> don't
>> > think it will change.
>> >
>> > Any improvements for the sbt build are of course welcome (it is still
>> used
>> > by many developers), but i would not do anything that increases the
>> burden
>> > of maintaining two build systems.
>> >
>> > On Nov 5, 2015 18:38, "Jakob Odersky"  wrote:
>> >>
>> >> Hi everyone,
>> >> in the process of learning Spark, I wanted to get an overview of the
>> >> interaction between all of its sub-projects. I therefore decided to
>> have a
>> >> look at the build setup and its dependency management.
>> >> Since I am alot more comfortable using sbt than maven, I decided to
>> try to
>> >> port the maven configuration to sbt (with the help of automated tools).
>> >> This led me to a couple of observations and questions on the build
>> system
>> >> design:
>> >>
>> >> First, currently, there are two build systems, maven and sbt. Is there
>> a
>> >> preferred tool (or future direction to one)?
>> >>
>> >> Second, the sbt build also uses maven "profiles" requiring the use of
>> >> specific commandline parameters when starting sbt. Furthermore, since
>> it
>> >> relies on maven poms, dependencies to the scala binary version (_2.xx)
>> are
>> >> hardcoded and require running an external script when switching
>> versions.
>> >> Sbt could leverage built-in constructs to support cross-compilation and
>> >> emulate profiles with configurations and new build targets. This would
>> >> remove external state from the build (in that no extra steps need to be
>> >> performed in a particular order to generate artifacts for a new
>> >> configuration) and therefore improve stability and build
>> reproducibility
>> >> (maybe even build performance). I was wondering if implementing such
>> >> functionality for the sbt build would be welcome?
>> >>
>> >> thanks,
>> >> --Jakob
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

State of the Build

2015-11-05 Thread Jakob Odersky

Hi everyone,
in the process of learning Spark, I wanted to get an overview of the
interaction between all of its sub-projects. I therefore decided to have a
look at the build setup and its dependency management.
Since I am alot more comfortable using sbt than maven, I decided to try to
port the maven configuration to sbt (with the help of automated tools).
This led me to a couple of observations and questions on the build system
design:

First, currently, there are two build systems, maven and sbt. Is there a
preferred tool (or future direction to one)?

Second, the sbt build also uses maven "profiles" requiring the use of
specific commandline parameters when starting sbt. Furthermore, since it
relies on maven poms, dependencies to the scala binary version (_2.xx) are
hardcoded and require running an external script when switching versions.
Sbt could leverage built-in constructs to support cross-compilation and
emulate profiles with configurations and new build targets. This would
remove external state from the build (in that no extra steps need to be
performed in a particular order to generate artifacts for a new
configuration) and therefore improve stability and build reproducibility
(maybe even build performance). I was wondering if implementing such
functionality for the sbt build would be welcome?

thanks,
--Jakob

Re: Turn off logs in spark-sql shell

2015-10-16 Thread Jakob Odersky

[repost to mailing list, ok I gotta really start hitting that
reply-to-all-button]

Hi,
Spark uses Log4j which unfortunately does not support fine-grained
configuration over the command line. Therefore some configuration file
editing will have to be done (unless you want to configure Loggers
programatically, which however would require editing spark-sql).
Nevertheless, there seems to be a kind of "trick" where you can substitute
java environment variables in the log4j configuration file. See this
stackoverflow answer for details http://stackoverflow.com/a/31208461/917519.
After editing the properties file, you can then start spark-sql with:

bin/spark-sql --conf
"spark.driver.extraJavaOptions=-Dmy.logger.threshold=OFF"

this is untested but I hop it helps,
--Jakob

On 15 October 2015 at 22:56, Muhammad Ahsan 
wrote:

> Hello Everyone!
>
> I want to know how to turn off logging during starting *spark-sql shell*
> without changing log4j configuration files. In normal spark-shell I can use
> the following commands
>
> import org.apache.log4j.Loggerimport org.apache.log4j.Level
> Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF)
>
>
> Thanks
>
> --
> Thanks
>
> Muhammad Ahsan
>
>

Re: Insight into Spark Packages

2015-10-16 Thread Jakob Odersky

[repost to mailing list]

I don't know much about packages, but have you heard about the
sbt-spark-package plugin?
Looking at the code, specifically
https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackagePlugin.scala,
might give you insight on the details about package creation. Package
submission is implemented in
https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackageHttp.scala

At a quick first overview, it seems packages are bundled as maven artifacts
and then posted to "http://spark-packages.org/api/submit-release";.

Hope this helps for your last question

On 16 October 2015 at 08:43, jeff saremi  wrote:

> I'm looking for any form of documentation on Spark Packages
> Specifically, what happens when one issues a command like the following:
>
>
> $SPARK_HOME/bin/spark-shell --packages RedisLabs:spark-redis:0.1.0
>
>
> Something like an architecture diagram.
> What happens when this package gets submitted?
> Does this need to be done each time?
> Is that package downloaded each time?
> Is there a persistent cache on the server (master i guess)?
> Can these packages be installed offline with no Internet connectivity?
> How does a package get created?
>
> and so on and so forth
>

[jira] [Created] (SPARK-11122) Fatal warnings in sbt are not displayed as such

2015-10-14 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-11122:
-

 Summary: Fatal warnings in sbt are not displayed as such
 Key: SPARK-11122
 URL: https://issues.apache.org/jira/browse/SPARK-11122
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Jakob Odersky


The sbt script treats warnings (except dependency warnings) as errors, however 
there is no visual difference between errors and fatal warnings, thus leading 
to very confusing debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957721#comment-14957721
 ] 

Jakob Odersky commented on SPARK-0:
---

exactly what I got, I'll take a look at it

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Status of SBT Build

2015-10-14 Thread Jakob Odersky

Hi everyone,

I've been having trouble building Spark with SBT recently. Scala 2.11
doesn't work and in all cases I get large amounts of warnings and even
errors on tests.

I was therefore wondering what the official status of spark with sbt is? Is
it very new and still buggy or unmaintained and "falling to pieces"?

In any case, I would be glad to help with any issues on setting up a clean
and working build with sbt.

thanks,
--Jakob

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Jakob Odersky

[Repost to mailing list]

Hey,
Sorry about the typo, I of course meant hadoop-2.6, not 2.11.
I suspect something bad happened with my Ivy cache, since when reverting
back to scala 2.10, I got a very strange IllegalStateException (something
something IvyNode, I can't remember the details).
Kilking the cache made 2.10 work at least, I'll retry with 2.11

Thx for your help
On Oct 14, 2015 6:52 AM, "Ted Yu"  wrote:

> Adrian:
> Likely you were using maven.
>
> Jakob's report was with sbt.
>
> Cheers
>
> On Tue, Oct 13, 2015 at 10:05 PM, Adrian Tanase  wrote:
>
>> Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also
>> compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works.
>>
>> -adrian
>>
>> Sent from my iPhone
>>
>> On 14 Oct 2015, at 03:53, Jakob Odersky  wrote:
>>
>> I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
>> use is:
>>
>> dev/change-version-to-2.11.sh
>> build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11
>>
>> followed by
>>
>> compile
>>
>> in the sbt shell.
>>
>> The error I get specifically is:
>>
>> spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>> no valid targets for annotation on value conf - it is discarded unused. You
>> may specify targets with meta-annotations, e.g. @(transient @param)
>> [error] private[netty] class NettyRpcEndpointRef(@transient conf:
>> SparkConf)
>> [error]
>>
>> However I am also getting a large amount of deprecation warnings, making
>> me wonder if I am supplying some incompatible/unsupported options to sbt. I
>> am using Java 1.8 and the latest Spark master sources.
>> Does someone know if I am doing anything wrong or is the sbt build broken?
>>
>> thanks for you help,
>> --Jakob
>>
>>
>

[jira] [Created] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-11094:
-

 Summary: Test runner script fails to parse Java version.
 Key: SPARK-11094
 URL: https://issues.apache.org/jira/browse/SPARK-11094
 Project: Spark
  Issue Type: Bug
  Components: Tests
 Environment: Debian testing
Reporter: Jakob Odersky
Priority: Minor


Running {{dev/run-tests}} fails when the local Java version has an extra string 
appended to the version.
For example, in Debian Stretch (currently testing distribution), {{java 
-version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Building with SBT and Scala 2.11

2015-10-13 Thread Jakob Odersky

I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
use is:

dev/change-version-to-2.11.sh
build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11

followed by

compile

in the sbt shell.

The error I get specifically is:

spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
no valid targets for annotation on value conf - it is discarded unused. You
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error]

However I am also getting a large amount of deprecation warnings, making me
wonder if I am supplying some incompatible/unsupported options to sbt. I am
using Java 1.8 and the latest Spark master sources.
Does someone know if I am doing anything wrong or is the sbt build broken?

thanks for you help,
--Jakob

[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-11092:
--
Description: 
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url 
bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
Feel free to tell me if I should use something else as base url.

  was:
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free to 
tell me if I should use something else as base url.


> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955996#comment-14955996
 ] 

Jakob Odersky commented on SPARK-11092:
---

I can't set the assignee field, though I'd like to resolve this issue.

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url  
> "https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free 
> to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-11092:
--
Description: 
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free to 
tell me if I should use something else as base url.

  was:
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH";). Feel free to 
tell me if I should use something else as base url.


> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url  
> "https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free 
> to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-11092:
-

 Summary: Add source URLs to API documentation.
 Key: SPARK-11092
 URL: https://issues.apache.org/jira/browse/SPARK-11092
 Project: Spark
  Issue Type: Documentation
  Components: Build, Documentation
Reporter: Jakob Odersky
Priority: Trivial


It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH";). Feel free to 
tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark Event Listener

2015-10-13 Thread Jakob Odersky

the path of the source file defining the event API is
`core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala`

On 13 October 2015 at 16:29, Jakob Odersky  wrote:

> Hi,
> I came across the spark listener API while checking out possible UI
> extensions recently. I noticed that all events inherit from a sealed trait
> `SparkListenerEvent` and that a SparkListener has a corresponding
> `onEventXXX(event)` method for every possible event.
>
> Considering that events inherit from a sealed trait and thus all events
> are known during compile-time, what is the rationale of using specific
> methods for every event rather than a single method that would let a client
> pattern match on the type of event?
>
> I don't know the internals of the pattern matcher, but again, considering
> events are sealed, I reckon that matching performance should not be an
> issue.
>
> thanks,
> --Jakob
>

Spark Event Listener

2015-10-13 Thread Jakob Odersky

Hi,
I came across the spark listener API while checking out possible UI
extensions recently. I noticed that all events inherit from a sealed trait
`SparkListenerEvent` and that a SparkListener has a corresponding
`onEventXXX(event)` method for every possible event.

Considering that events inherit from a sealed trait and thus all events are
known during compile-time, what is the rationale of using specific methods
for every event rather than a single method that would let a client pattern
match on the type of event?

I don't know the internals of the pattern matcher, but again, considering
events are sealed, I reckon that matching performance should not be an
issue.

thanks,
--Jakob

Live UI

2015-10-12 Thread Jakob Odersky

Hi everyone,
I am just getting started working on spark and was thinking of a first way
to contribute whilst still trying to wrap my head around the codebase.

Exploring the web UI, I noticed it is a classic request-response website,
requiring manual refresh to get the latest data.
I think it would be great to have a "live" website where data would be
displayed real-time without the need to hit the refresh button. I would be
very interested in contributing this feature if it is acceptable.

Specifically, I was thinking of using websockets with a ScalaJS front-end.
Please let me know if this design would be welcome or if it introduces
unwanted dependencies, I'll be happy to discuss this further in detail.

thanks for your feedback,
--Jakob

[jira] [Commented] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951434#comment-14951434
 ] 

Jakob Odersky commented on SPARK-10876:
---

Do you mean to display the total run time of uncompleted apps? Completed apps 
already have a "Duration" field

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-10876:
--
Comment: was deleted

(was: I'm not sure what you mean. The UI already has a "Duration" field for 
every job.)

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951419#comment-14951419
 ] 

Jakob Odersky commented on SPARK-10876:
---

I'm not sure what you mean. The UI already has a "Duration" field for every job.

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10306) sbt hive/update issue

2015-10-09 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951362#comment-14951362
 ] 

Jakob Odersky commented on SPARK-10306:
---

Same issue here

> sbt hive/update issue
> -
>
> Key: SPARK-10306
> URL: https://issues.apache.org/jira/browse/SPARK-10306
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>Priority: Trivial
>
> Running sbt hive/update sometimes results in the error "impossible to get 
> artifacts when data has not been loaded. IvyNode = 
> org.scala-lang#scala-library;2.10.3" which is unfortunate since it is always 
> evicted by 2.10.4 currently. An easy (but maybe not super clean) solution 
> would be adding 2.10.3 as a dependency which will then get evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4

301 - 371 of 371 matches

Mail list logo