from:"Jakob Odersky"

Re: Third party library

2016-12-13 Thread Jakob Odersky

Hi Vineet,
great to see you solved the problem! Since this just appeared in my
inbox, I wanted to take the opportunity for a shameless plug:
https://github.com/jodersky/sbt-jni. In case you're using sbt and also
developing the native library, this plugin may help with the pains of
building and packaging JNI applications.

cheers,
--Jakob

On Tue, Dec 13, 2016 at 11:02 AM, vineet chadha  wrote:
> Thanks Steve and Kant. Apologies for late reply as I was out for vacation.
> Got  it working.  For other users:
>
> def loadResources() {
>
> System.loadLibrary("foolib")
>
>  val MyInstance  = new MyClass
>
>  val retstr = MyInstance.foo("mystring") // method trying to invoke
>
>  }
>
> val conf = new
> SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
> "/lib/location")
>
> val sc = new SparkContext(conf)
>
> sc.parallelize(1 to 10, 2).mapPartitions ( iter => {
>
> MySimpleApp.loadResources()
>
> iter
>
> }).count
>
>
>
> Regards,
> Vineet
>
> On Sun, Nov 27, 2016 at 2:15 PM, Steve Loughran 
> wrote:
>>
>>
>> On 27 Nov 2016, at 02:55, kant kodali  wrote:
>>
>> I would say instead of LD_LIBRARY_PATH you might want to use
>> java.library.path
>>
>> in the following way
>>
>> java -Djava.library.path=/path/to/my/library or pass java.library.path
>> along with spark-submit
>>
>>
>> This is only going to set up paths on the submitting system; to load JNI
>> code in the executors, the binary needs to be sent to far end and then put
>> on the Java load path there.
>>
>> Copy the relevant binary to somewhere on the PATH of the destination
>> machine. Do that and you shouldn't have to worry about other JVM options,
>> (though it's been a few years since I did any JNI).
>>
>> One trick: write a simple main() object/entry point which calls the JNI
>> method, and doesn't attempt to use any spark libraries; have it log any
>> exception and return an error code if the call failed. This will let you use
>> it as a link test after deployment: if you can't run that class then things
>> are broken, before you go near spark
>>
>>
>> On Sat, Nov 26, 2016 at 6:44 PM, Gmail  wrote:
>>>
>>> Maybe you've already checked these out. Some basic questions that come to
>>> my mind are:
>>> 1) is this library "foolib" or "foo-C-library" available on the worker
>>> node?
>>> 2) if yes, is it accessible by the user/program (rwx)?
>>>
>>> Thanks,
>>> Vasu.
>>>
>>> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
>>>
>>> If it is working for standalone program I would think you can apply the
>>> same settings across all the spark worker  and client machines and give that
>>> a try. Lets start with that.
>>>
>>> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
>>> wrote:

 Just subscribed to  Spark User.  So, forwarding message again.

 On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
 wrote:
>
> Thanks Kant. Can you give me a sample program which allows me to call
> jni from executor task ?   I have jni working in standalone program in
> scala/java.
>
> Regards,
> Vineet
>
> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
> wrote:
>>
>> Yes this is a Java JNI question. Nothing to do with Spark really.
>>
>>  java.lang.UnsatisfiedLinkError typically would mean the way you setup
>> LD_LIBRARY_PATH is wrong unless you tell us that it is working for other
>> cases but not this one.
>>
>> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
>> wrote:
>>>
>>> That's just standard JNI and has nothing to do with Spark, does it?
>>>
>>>
>>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha
>>>  wrote:

 Thanks Reynold for quick reply.

  I have tried following:

 class MySimpleApp {
  // ---Native methods
   @native def fooMethod (foo: String): String
 }

 object MySimpleApp {
   val flag = false
   def loadResources() {
 System.loadLibrary("foo-C-library")
   val flag = true
   }
   def main() {
 sc.parallelize(1 to 10).mapPartitions ( iter => {
   if(flag == false){
   MySimpleApp.loadResources()
  val SimpleInstance = new MySimpleApp
   }
   SimpleInstance.fooMethod ("fooString")
   iter
 })
   }
 }

 I don't see way to invoke fooMethod which is implemented in
 foo-C-library. Is I am missing something ? If possible, can you point 
 me to
 existing implementation which i can refer to.

 Thanks again.

 ~


 On Fri,

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jakob Odersky

Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.

On Mon, Dec 12, 2016 at 3:06 PM, Reth RM  wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api to
> load them and process further. Right now, testing it with 40k records and at
> the time of loading files(wholeTextFiles), it waits for minimum of 8-9
> minutes. What are some recommended optimizations? Should consider any file
> stream apis of spark instead of "wholeTextFiles"?
>
>
> Other info
> Running 1 master, 4 worker nodes 4 allocated.
> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky

Also, in case the issue was not due to the string length (however it
is still valid and may get you later), the issue may be due to some
other indexing issues which are currently being worked on here
https://issues.apache.org/jira/browse/SPARK-6235

On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky <ja...@odersky.com> wrote:
> Hi Pradeep,
>
> I'm afraid you're running into a hard Java issue. Strings are indexed
> with signed integers and can therefore not be longer than
> approximately 2 billion characters. Could you use `textFile` as a
> workaround? It will give you an RDD of the files' lines instead.
>
> In general, this guide http://spark.apache.org/contributing.html gives
> information on how to contribute to spark, including instructions on
> how to file bug reports (which does not apply in this case as it isn't
> a bug in Spark).
>
> regards,
> --Jakob
>
> On Mon, Dec 12, 2016 at 7:34 PM, Pradeep <pradeep.mi...@mail.com> wrote:
>> Hi,
>>
>> Why there is an restriction on max file size that can be read by 
>> wholeTextFile() method.
>>
>> I can read a 1.5 gigs file but get Out of memory for 2 gig file.
>>
>> Also, how can I raise this as an defect in spark jira. Can someone please 
>> guide.
>>
>> Thanks,
>> Pradeep
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky

Hi Pradeep,

I'm afraid you're running into a hard Java issue. Strings are indexed
with signed integers and can therefore not be longer than
approximately 2 billion characters. Could you use `textFile` as a
workaround? It will give you an RDD of the files' lines instead.

In general, this guide http://spark.apache.org/contributing.html gives
information on how to contribute to spark, including instructions on
how to file bug reports (which does not apply in this case as it isn't
a bug in Spark).

regards,
--Jakob

On Mon, Dec 12, 2016 at 7:34 PM, Pradeep  wrote:
> Hi,
>
> Why there is an restriction on max file size that can be read by 
> wholeTextFile() method.
>
> I can read a 1.5 gigs file but get Out of memory for 2 gig file.
>
> Also, how can I raise this as an defect in spark jira. Can someone please 
> guide.
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: custom generate spark application id

2016-12-05 Thread Jakob Odersky

The app ID is assigned internally by spark's task scheduler
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L35.
You could probably change the naming, however I'm pretty sure that the
ID will always have to be unique for a context on a cluster.
Alternatively, could setting the name (conf.setAppName or via
"spark.app.name" config) help with what you're trying to achieve?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky

The issue I was having had to do with missing classpath settings; in
sbt it can be solved by setting `fork:=true` to run tests in new jvms
with appropriate classpaths.

Mohit, from the looks of the error message, it also appears to be some
classpath issue. This typically happens when there are libraries of
multiple scala versions on the same classpath. You mention that it
worked before, can you recall what libraries you upgraded before it
broke?

--Jakob

On Mon, Nov 21, 2016 at 2:34 PM, Jakob Odersky <ja...@odersky.com> wrote:
> Trying it out locally gave me an NPE. I'll look into it in more
> detail, however the SparkILoop.run() method is dead code. It's used
> nowhere in spark and can be removed without any issues.
>
> On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi <mohitja...@gmail.com> wrote:
>> Thanks Holden. I did post to the user list but since this is not a common
>> case, I am trying the developer list as well. Yes there is a reason: I get
>> code from somewhere e.g. a notebook. This type of code did work for me
>> before.
>>
>> Mohit Jaggi
>> Founder,
>> Data Orchard LLC
>> www.dataorchardllc.com
>>
>>
>>
>>
>> On Nov 17, 2016, at 8:53 AM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>
>> Moving to user list
>>
>> So this might be a better question for the user list - but is there a reason
>> you are trying to use the SparkILoop for tests?
>>
>> On Thu, Nov 17, 2016 at 5:47 PM Mohit Jaggi <mohitja...@gmail.com> wrote:
>>>
>>>
>>>
>>> I am trying to use SparkILoop to write some tests(shown below) but the
>>> test hangs with the following stack trace. Any idea what is going on?
>>>
>>>
>>> import org.apache.log4j.{Level, LogManager}
>>> import org.apache.spark.repl.SparkILoop
>>> import org.scalatest.{BeforeAndAfterAll, FunSuite}
>>>
>>> class SparkReplSpec extends FunSuite with BeforeAndAfterAll {
>>>
>>>   override def beforeAll(): Unit = {
>>>   }
>>>
>>>   override def afterAll(): Unit = {
>>>   }
>>>
>>>   test("yay!") {
>>> val rootLogger = LogManager.getRootLogger
>>> val logLevel = rootLogger.getLevel
>>> rootLogger.setLevel(Level.ERROR)
>>>
>>> val output = SparkILoop.run(
>>>   """
>>> |println("hello")
>>>   """.stripMargin)
>>>
>>> println(s" $output ")
>>>
>>>   }
>>> }
>>>
>>>
>>> /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/bin/java
>>> -Dspark.master=local[*] -Didea.launcher.port=7532
>>> "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA CE.app/Contents/bin"
>>> -Dfile.encoding=UTF-8 -classpath "/Users/mohit/Library/Application
>>> Support/IdeaIC2016.2/Scala/lib/scala-plugin-runners.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/deploy.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/javaws.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/plugin.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky

Trying it out locally gave me an NPE. I'll look into it in more
detail, however the SparkILoop.run() method is dead code. It's used
nowhere in spark and can be removed without any issues.

On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi  wrote:
> Thanks Holden. I did post to the user list but since this is not a common
> case, I am trying the developer list as well. Yes there is a reason: I get
> code from somewhere e.g. a notebook. This type of code did work for me
> before.
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>
>
>
> On Nov 17, 2016, at 8:53 AM, Holden Karau  wrote:
>
> Moving to user list
>
> So this might be a better question for the user list - but is there a reason
> you are trying to use the SparkILoop for tests?
>
> On Thu, Nov 17, 2016 at 5:47 PM Mohit Jaggi  wrote:
>>
>>
>>
>> I am trying to use SparkILoop to write some tests(shown below) but the
>> test hangs with the following stack trace. Any idea what is going on?
>>
>>
>> import org.apache.log4j.{Level, LogManager}
>> import org.apache.spark.repl.SparkILoop
>> import org.scalatest.{BeforeAndAfterAll, FunSuite}
>>
>> class SparkReplSpec extends FunSuite with BeforeAndAfterAll {
>>
>>   override def beforeAll(): Unit = {
>>   }
>>
>>   override def afterAll(): Unit = {
>>   }
>>
>>   test("yay!") {
>> val rootLogger = LogManager.getRootLogger
>> val logLevel = rootLogger.getLevel
>> rootLogger.setLevel(Level.ERROR)
>>
>> val output = SparkILoop.run(
>>   """
>> |println("hello")
>>   """.stripMargin)
>>
>> println(s" $output ")
>>
>>   }
>> }
>>
>>
>> /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/bin/java
>> -Dspark.master=local[*] -Didea.launcher.port=7532
>> "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA CE.app/Contents/bin"
>> -Dfile.encoding=UTF-8 -classpath "/Users/mohit/Library/Application
>>

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Jakob Odersky

 > how do I tell my spark driver program to not create so many?

This may depend on your driver program. Do you spawn any threads in
it? Could you share some more information on the driver program, spark
version and your environment? It would greatly help others to help you

On Mon, Oct 31, 2016 at 3:47 AM, kant kodali  wrote:
> The source of my problem is actually that I am running into the following
> error. This error seems to happen after running my driver program for 4
> hours.
>
> "Exception in thread "ForkJoinPool-50-worker-11" Exception in thread
> "dag-scheduler-event-loop" Exception in thread "ForkJoinPool-50-worker-13"
> java.lang.OutOfMemoryError: unable to create new native thread"
>
> and this wonderful book taught me that the error "unable to create new
> native thread" can happen because JVM is trying to request the OS for a
> thread and it is refusing to do so for the following reasons
>
> 1. The system has actually run out of virtual memory.
> 2. On Unix-style systems, the user has already created (between all programs
> user is running) the maximum number of processes configured for that user
> login. Individual threads are considered a process in that regard.
>
> Option #2 is ruled out in my case because my driver programing is running
> with a userid of root which has  maximum number of processes set to 120242
>
> ulimit -a gives me the following
>
> core file size  (blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 120242
> max locked memory   (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 120242
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
> So at this point I do understand that the I am running out of memory due to
> allocation of threads so my biggest question is how do I tell my spark
> driver program to not create so many?
>
> On Mon, Oct 31, 2016 at 3:25 AM, Sean Owen  wrote:
>>
>> ps -L [pid] is what shows threads. I am not sure this is counting what you
>> think it does. My shell process has about a hundred threads, and I can't
>> imagine why one would have thousands unless your app spawned them.
>>
>> On Mon, Oct 31, 2016 at 10:20 AM kant kodali  wrote:
>>>
>>> when I do
>>>
>>> ps -elfT | grep "spark-driver-program.jar" | wc -l
>>>
>>> The result is around 32K. why does it create so many threads how can I
>>> limit this?
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky

Yes, thanks for elaborating Michael.
The other thing that I wanted to highlight was that in this specific
case the value is actually exactly zero (0E-18 = 0*10^(-18) = 0).

On Mon, Oct 24, 2016 at 8:50 PM, Michael Matsko <m...@gwmail.gwu.edu> wrote:
> Efe,
>
> I think Jakob's point is that that there is no problem.  When you deal with
> real numbers, you don't get exact representations of numbers.  There is
> always some slop in representations, things don't ever cancel out exactly.
> Testing reals for equality to zero will almost never work.
>
> Look at Goldberg's paper
> https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf
> for a quick intro.
>
> Mike
>
> On Oct 24, 2016, at 10:36 PM, Efe Selcuk <efema...@gmail.com> wrote:
>
> Okay, so this isn't contributing to any kind of imprecision. I suppose I
> need to go digging further then. Thanks for the quick help.
>
> On Mon, Oct 24, 2016 at 7:34 PM Jakob Odersky <ja...@odersky.com> wrote:
>>
>> What you're seeing is merely a strange representation, 0E-18 is zero.
>> The E-18 represents the precision that Spark uses to store the decimal
>>
>> On Mon, Oct 24, 2016 at 7:32 PM, Jakob Odersky <ja...@odersky.com> wrote:
>> > An even smaller example that demonstrates the same behaviour:
>> >
>> > Seq(Data(BigDecimal(0))).toDS.head
>> >
>> > On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk <efema...@gmail.com> wrote:
>> >> I’m trying to track down what seems to be a very slight imprecision in
>> >> our
>> >> Spark application; two of our columns, which should be netting out to
>> >> exactly zero, are coming up with very small fractions of non-zero
>> >> value. The
>> >> only thing that I’ve found out of place is that a case class entry into
>> >> a
>> >> Dataset we’ve generated with BigDecimal(“0”) will end up as 0E-18 after
>> >> it
>> >> goes through Spark, and I don’t know if there’s any appreciable
>> >> difference
>> >> between that and the actual 0 value, which can be generated with
>> >> BigDecimal.
>> >> Here’s a contrived example:
>> >>
>> >> scala> case class Data(num: BigDecimal)
>> >> defined class Data
>> >>
>> >> scala> val x = Data(0)
>> >> x: Data = Data(0)
>> >>
>> >> scala> x.num
>> >> res9: BigDecimal = 0
>> >>
>> >> scala> val y = Seq(x, x.copy()).toDS.reduce( (a,b) => a.copy(a.num +
>> >> b.num))
>> >> y: Data = Data(0E-18)
>> >>
>> >> scala> y.num
>> >> res12: BigDecimal = 0E-18
>> >>
>> >> scala> BigDecimal("1") - 1
>> >> res15: scala.math.BigDecimal = 0
>> >>
>> >> Am I looking at anything valuable?
>> >>
>> >> Efe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky

What you're seeing is merely a strange representation, 0E-18 is zero.
The E-18 represents the precision that Spark uses to store the decimal

On Mon, Oct 24, 2016 at 7:32 PM, Jakob Odersky <ja...@odersky.com> wrote:
> An even smaller example that demonstrates the same behaviour:
>
> Seq(Data(BigDecimal(0))).toDS.head
>
> On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk <efema...@gmail.com> wrote:
>> I’m trying to track down what seems to be a very slight imprecision in our
>> Spark application; two of our columns, which should be netting out to
>> exactly zero, are coming up with very small fractions of non-zero value. The
>> only thing that I’ve found out of place is that a case class entry into a
>> Dataset we’ve generated with BigDecimal(“0”) will end up as 0E-18 after it
>> goes through Spark, and I don’t know if there’s any appreciable difference
>> between that and the actual 0 value, which can be generated with BigDecimal.
>> Here’s a contrived example:
>>
>> scala> case class Data(num: BigDecimal)
>> defined class Data
>>
>> scala> val x = Data(0)
>> x: Data = Data(0)
>>
>> scala> x.num
>> res9: BigDecimal = 0
>>
>> scala> val y = Seq(x, x.copy()).toDS.reduce( (a,b) => a.copy(a.num + b.num))
>> y: Data = Data(0E-18)
>>
>> scala> y.num
>> res12: BigDecimal = 0E-18
>>
>> scala> BigDecimal("1") - 1
>> res15: scala.math.BigDecimal = 0
>>
>> Am I looking at anything valuable?
>>
>> Efe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky

An even smaller example that demonstrates the same behaviour:

Seq(Data(BigDecimal(0))).toDS.head

On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk  wrote:
> I’m trying to track down what seems to be a very slight imprecision in our
> Spark application; two of our columns, which should be netting out to
> exactly zero, are coming up with very small fractions of non-zero value. The
> only thing that I’ve found out of place is that a case class entry into a
> Dataset we’ve generated with BigDecimal(“0”) will end up as 0E-18 after it
> goes through Spark, and I don’t know if there’s any appreciable difference
> between that and the actual 0 value, which can be generated with BigDecimal.
> Here’s a contrived example:
>
> scala> case class Data(num: BigDecimal)
> defined class Data
>
> scala> val x = Data(0)
> x: Data = Data(0)
>
> scala> x.num
> res9: BigDecimal = 0
>
> scala> val y = Seq(x, x.copy()).toDS.reduce( (a,b) => a.copy(a.num + b.num))
> y: Data = Data(0E-18)
>
> scala> y.num
> res12: BigDecimal = 0E-18
>
> scala> BigDecimal("1") - 1
> res15: scala.math.BigDecimal = 0
>
> Am I looking at anything valuable?
>
> Efe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Jakob Odersky

Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon  wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" 
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>>>
>>> Hi:
>>>I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky

Just thought of another potential issue: you should use the "provided"
scope when depending on spark. I.e in your project's pom:

org.apache.spark
spark-core_2.11
2.0.1
provided


On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky <ja...@odersky.com> wrote:

> Ho do you submit the application? A version mismatch between the launcher,
> driver and workers could lead to the bug you're seeing. A common reason for
> a mismatch is if the SPARK_HOME environment variable is set. This will
> cause the spark-submit script to use the launcher determined by that
> environment variable, regardless of the directory from which it was called.
>
> On Mon, Oct 10, 2016 at 3:42 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> +1 Wooho I have the same problem. I have been trying hard to fix this.
>>
>>
>>
>> On Mon, Oct 10, 2016 3:23 AM, vaibhav thapliyal
>> vaibhav.thapliyal...@gmail.com wrote:
>>
>>> Hi,
>>> If I change the parameter inside the setMaster()  to "local", the
>>> program runs. Is there something wrong with the cluster installation?
>>>
>>> I used the spark-2.0.1-bin-hadoop2.7.tgz package to install on my
>>> cluster with default configuration.
>>>
>>> Thanks
>>> Vaibhav
>>>
>>> On 10 Oct 2016 12:49, "vaibhav thapliyal" <vaibhav.thapliyal...@gmail.co
>>> m> wrote:
>>>
>>> Here is the code that I am using:
>>>
>>> public class SparkTest {
>>>
>>>
>>> public static void main(String[] args) {
>>>
>>> SparkConf conf = new SparkConf().setMaster("spark://
>>> 192.168.10.174:7077").setAppName("TestSpark");
>>> JavaSparkContext sc = new JavaSparkContext(conf);
>>>
>>> JavaRDD textFile = sc.textFile("sampleFile.txt");
>>> JavaRDD words = textFile.flatMap(new
>>> FlatMapFunction<String, String>() {
>>> public Iterator call(String s) {
>>> return Arrays.asList(s.split(" ")).iterator();
>>> }
>>> });
>>> JavaPairRDD<String, Integer> pairs = words.mapToPair(new
>>> PairFunction<String, String, Integer>() {
>>> public Tuple2<String, Integer> call(String s) {
>>> return new Tuple2<String, Integer>(s, 1);
>>> }
>>> });
>>> JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
>>> Function2<Integer, Integer, Integer>() {
>>> public Integer call(Integer a, Integer b) {
>>> return a + b;
>>> }
>>> });
>>> counts.saveAsTextFile("outputFile.txt");
>>>
>>> }
>>> }
>>>
>>> The content of the input file:
>>> Hello Spark
>>> Hi Spark
>>> Spark is running
>>>
>>>
>>> I am using the spark 2.0.1 dependency from maven.
>>>
>>> Thanks
>>> Vaibhav
>>>
>>> On 10 October 2016 at 12:37, Sudhanshu Janghel <
>>> sudhanshu.jang...@cloudwick.com> wrote:
>>>
>>> Seems like a straightforward error it's trying to cast something as a
>>> list which is not a list or cannot be casted.  Are you using standard
>>> example code? Can u send the input and code?
>>>
>>> On Oct 10, 2016 9:05 AM, "vaibhav thapliyal" <
>>> vaibhav.thapliyal...@gmail.com> wrote:
>>>
>>> Dear All,
>>>
>>> I am getting a ClassCastException Error when using the JAVA API to run
>>> the wordcount example from the docs.
>>>
>>> Here is the log that I got:
>>>
>>> 16/10/10 11:52:12 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 4)
>>> java.lang.ClassCastException: cannot assign instance of 
>>> scala.collection.immutable.List$SerializationProxy to field 
>>> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
>>> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>>> at 
>>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>>> at 
>>> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>>> at 
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>>> at java.io.ObjectInputStre

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky

Ho do you submit the application? A version mismatch between the launcher,
driver and workers could lead to the bug you're seeing. A common reason for
a mismatch is if the SPARK_HOME environment variable is set. This will
cause the spark-submit script to use the launcher determined by that
environment variable, regardless of the directory from which it was called.

On Mon, Oct 10, 2016 at 3:42 AM, kant kodali  wrote:

> +1 Wooho I have the same problem. I have been trying hard to fix this.
>
>
>
> On Mon, Oct 10, 2016 3:23 AM, vaibhav thapliyal
> vaibhav.thapliyal...@gmail.com wrote:
>
>> Hi,
>> If I change the parameter inside the setMaster()  to "local", the program
>> runs. Is there something wrong with the cluster installation?
>>
>> I used the spark-2.0.1-bin-hadoop2.7.tgz package to install on my cluster
>> with default configuration.
>>
>> Thanks
>> Vaibhav
>>
>> On 10 Oct 2016 12:49, "vaibhav thapliyal" 
>> wrote:
>>
>> Here is the code that I am using:
>>
>> public class SparkTest {
>>
>>
>> public static void main(String[] args) {
>>
>> SparkConf conf = new SparkConf().setMaster("spark://
>> 192.168.10.174:7077").setAppName("TestSpark");
>> JavaSparkContext sc = new JavaSparkContext(conf);
>>
>> JavaRDD textFile = sc.textFile("sampleFile.txt");
>> JavaRDD words = textFile.flatMap(new
>> FlatMapFunction() {
>> public Iterator call(String s) {
>> return Arrays.asList(s.split(" ")).iterator();
>> }
>> });
>> JavaPairRDD pairs = words.mapToPair(new
>> PairFunction() {
>> public Tuple2 call(String s) {
>> return new Tuple2(s, 1);
>> }
>> });
>> JavaPairRDD counts = pairs.reduceByKey(new
>> Function2() {
>> public Integer call(Integer a, Integer b) {
>> return a + b;
>> }
>> });
>> counts.saveAsTextFile("outputFile.txt");
>>
>> }
>> }
>>
>> The content of the input file:
>> Hello Spark
>> Hi Spark
>> Spark is running
>>
>>
>> I am using the spark 2.0.1 dependency from maven.
>>
>> Thanks
>> Vaibhav
>>
>> On 10 October 2016 at 12:37, Sudhanshu Janghel <
>> sudhanshu.jang...@cloudwick.com> wrote:
>>
>> Seems like a straightforward error it's trying to cast something as a
>> list which is not a list or cannot be casted.  Are you using standard
>> example code? Can u send the input and code?
>>
>> On Oct 10, 2016 9:05 AM, "vaibhav thapliyal" <
>> vaibhav.thapliyal...@gmail.com> wrote:
>>
>> Dear All,
>>
>> I am getting a ClassCastException Error when using the JAVA API to run
>> the wordcount example from the docs.
>>
>> Here is the log that I got:
>>
>> 16/10/10 11:52:12 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 4)
>> java.lang.ClassCastException: cannot assign instance of 
>> scala.collection.immutable.List$SerializationProxy to field 
>> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
>> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>>  at 
>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>>  at 
>> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>>  at 
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:86)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> 16/10/10 11:52:12 ERROR Executor:

Re: Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Jakob Odersky

It's still there on master. It is in the "spark-tags" module however
(under common/tags), maybe something changed in the build environment
and it isn't made available as a dependency to your project? What
happens if you include the module as a direct dependency?

--Jakob

On Tue, Oct 4, 2016 at 10:33 AM, Liren Ding  wrote:
> I just upgrade from Spark 1.6.1 to 2.0, and got  an java compile error:
> error: cannot access DeveloperApi
>   class file for org.apache.spark.annotation.DeveloperApi not found
>
> From the Spark 2.0 document
> (https://spark.apache.org/docs/2.0.0/api/java/overview-summary.html), the
> package org.apache.spark.annotation is removed. Does anyone know if it's
> moved to another package? Or how to call developerAPI with absence of the
> annotation? Thanks.
>
> Cheers,
> Liren
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: get different results when debugging and running scala program

2016-09-30 Thread Jakob Odersky

There is no image attached, I'm not sure how the apache mailing lists
handle them. Can you provide the output as text?

best,
--Jakob

On Fri, Sep 30, 2016 at 8:25 AM, chen yong  wrote:
> Hello All,
>
>
>
> I am using IDEA 15.0.4 to debug a scala program. It is strange to me that
> the results were different when I debug or run the program. The differences
> can be seen in the attached filed run.jpg and debug.jpg. The code lines of
> the scala program are shown below.
>
>
> Thank you all
>
>
> ---
>
> import scala.collection.mutable.ArrayBuffer
>
> object TestCase1{
> def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={
> println("in")
> val test1=test.flatmap{
> case(item,count)=>
> val newPrefix=item
> println(count)
> val a=Iterator.single((newPrefix,count))
> func(a)
> val c = a
> c
> }
> test1
> }
> def main(args: Array[String]){
> val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))
> val test = freqItems.toIterator
> val result = func(test)
> val reer = result.toArray
> }
> }
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread Jakob Odersky

Hi Shashikant,

I think you are trying to do too much at once in your helper class.
Spark's RDD API is functional, it is meant to be used by writing many
little transformations that will be distributed across a cluster.

Appart from that, `rdd.pipe` seems like a good approach. Here is the
relevant doc comment (in RDD.scala) on how to use it:

 Return an RDD created by piping elements to a forked external
process. The resulting RDD
   * is computed by executing the given process once per partition. All elements
   * of each input partition are written to a process's stdin as lines
of input separated
   * by a newline. The resulting partition consists of the process's
stdout output, with
   * each line of stdout resulting in one element of the output
partition. A process is invoked
   * even for empty partitions.
   *
   * [...]
Check the full docs here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@pipe(command:String):org.apache.spark.rdd.RDD[String]

This is how you could use it:

productRDD=//get from cassandra
processedRDD=productsRDD.map(STEP1).map(STEP2).pipe(C binary of step 3)
STEP4 //store processed RDD

hope this gives you some pointers,

best,
--Jakob




On Thu, Sep 22, 2016 at 2:10 AM, Shashikant Kulkarni (शशिकांत
कुलकर्णी) <shashikant.kulka...@gmail.com> wrote:
> Hello Jakob,
>
> Thanks for replying. Here is a short example of what I am trying. Taking an
> example of Product column family in Cassandra just for explaining my
> requirement
>
> In Driver.java
> {
>  JavaRDD productsRdd = Get Products from Cassandra;
>  productsRdd.map(ProductHelper.processProduct());
> }
>
> in ProductHelper.java
> {
>
> public static Function<Product, Boolean> processProduct() {
> return new Function< Product, Boolean>(){
> private static final long serialVersionUID = 1L;
>
> @Override
> public Boolean call(Product product) throws Exception {
> //STEP 1: Doing some processing on product object.
> //STEP 2: Now using few values of product, I need to create a string like
> "name id sku datetime"
> //STEP 3: Pass this string to my C binary file to perform some complex
> calculations and return some data
> //STEP 4: Get the return data and store it back in Cassandra DB
> }
> };
> }
> }
>
> In this ProductHelper, I cannot pass and don't want to pass sparkContext
> object as app will throw error of "task not serializable". If there is a way
> let me know.
>
> Now I am not able to achieve STEP 3 above. How can I pass a String to C
> binary and get the output back in my program. The C binary reads data from
> STDIN and outputs data to STDOUT. It is working from other part of
> application from PHP. I want to reuse the same C binary in my Apache SPARK
> application for some background processing and analysis using JavaRDD.pipe()
> API. If there is any other way let me know. This code will be executed in
> all the nodes in a cluster.
>
> Hope my requirement is now clear. How to do this?
>
> Regards,
> Shash
>
> On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky <ja...@odersky.com> wrote:
>>
>> Can you provide more details? It's unclear what you're asking
>>
>> On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
>> <shashikant.kulka...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I am trying to use the JavaRDD.pipe() API.
>> >
>> > I have one object with me from the JavaRDD
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark JavaRDD pipe() need help

2016-09-21 Thread Jakob Odersky

Can you provide more details? It's unclear what you're asking

On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
 wrote:
> Hi All,
>
> I am trying to use the JavaRDD.pipe() API.
>
> I have one object with me from the JavaRDD

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Jakob Odersky

One option would be to use Apache Toree. A quick setup guide can be
found here https://toree.incubator.apache.org/documentation/user/quick-start

On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka  wrote:
> Has anyone installed the scala kernel for Jupyter notebook.
>
>
>
> Any blogs or steps to follow in appreciated.
>
>
>
> thanks,
>
> Muby
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Task Deserialization Error

2016-09-21 Thread Jakob Odersky

Your app is fine, I think the error has to do with the way inttelij
launches applications. Is your app forked in a new jvm when you run it?

On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D 
wrote:

> Hello Sumit -
>
> I could see that SparkConf() specification is not being mentioned in your
> program. But rest looks good.
>
>
>
> Output:
>
>
> By the way, I have used the README.md template https://
> gist.github.com/jxson/1784669
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Sep 20, 2016 at 2:15 AM, Chawla,Sumit 
> wrote:
>
>> Hi All
>>
>> I am trying to test a simple Spark APP using scala.
>>
>>
>> import org.apache.spark.SparkContext
>>
>> object SparkDemo {
>>   def main(args: Array[String]) {
>> val logFile = "README.md" // Should be some file on your system
>>
>> // to run in local mode
>> val sc = new SparkContext("local", "Simple App", 
>> ""PATH_OF_DIRECTORY_WHERE_COMPILED_SPARK_PROJECT_FROM_GIT")
>>
>> val logData = sc.textFile(logFile).cache()
>> val numAs = logData.filter(line => line.contains("a")).count()
>> val numBs = logData.filter(line => line.contains("b")).count()
>>
>>
>> println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>>
>>   }
>> }
>>
>>
>> When running this demo in IntelliJ, i am getting following error:
>>
>>
>> java.lang.IllegalStateException: unread block data
>>  at 
>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>>  at 
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I guess its associated with task not being deserializable.  Any help will be 
>> appreciated.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Can I assign affinity for spark executor processes?

2016-09-13 Thread Jakob Odersky

Hi Xiaoye,
could it be that the executors were spawned before the affinity was
set on the worker? Would it help to start spark worker with taskset
from the beginning, i.e. "taskset [mask] start-slave.sh"?
Workers in spark (standalone mode) simply create processes with the
standard java process API. Unless there is something funky going on in
the JRE, I don't see how spark could affect cpu affinity.

regards,
--Jakob

On Tue, Sep 13, 2016 at 7:56 PM, Xiaoye Sun  wrote:
> Hi,
>
> In my experiment, I pin one very important process on a fixed CPU. So the
> performance of Spark task execution will be affected if the executors or the
> worker uses that CPU. I am wondering if it is possible to let the Spark
> executors not using a particular CPU.
>
> I tried to 'taskset -p [cpumask] [pid]' command to set the affinity of the
> Worker process. However, the executor processes created by the worker
> process don't inherit the same CPU affinity.
>
> Thanks!
>
> Best,
> Xiaoye

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky

> Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact 
> content on each partition i want to batch load some other data from DB, i 
> cannot operate in parallel due to resource contraints i have,  hence want to 
> sequential iterate over each partition and perform operations.


Ah I see. I think in that case your best option is to run several
jobs, selecting different subsets of your dataframe for each job and
running them one after the other. One way to do that would be to get
the underlying rdd, mapping with the partition's index and then
filtering and itering over every element. Eg.:

val withPartitionIndex = df.rdd.mapPartitionWithIndex((idx, it) =>
it.map(elem => (idx, elem))

for (i <- 0 until n) {
  withPartitionIndex.filter{case (idx, _) => idx == i}.foreach{ case
(idx, elem) =>
//do something with elem
  }
}

it's not the best use-case of Spark though and will probably be a
performance bottleneck.

On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky <ja...@odersky.com> wrote:
> Hi Sujeet,
>
> going sequentially over all parallel, distributed data seems like a
> counter-productive thing to do. What are you trying to accomplish?
>
> regards,
> --Jakob
>
> On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog <sujeet@gmail.com> wrote:
>> Hi,
>> Is there a way to iterate over a DataFrame with n partitions sequentially,
>>
>>
>> Thanks,
>> Sujeet
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky

Hi Sujeet,

going sequentially over all parallel, distributed data seems like a
counter-productive thing to do. What are you trying to accomplish?

regards,
--Jakob

On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog  wrote:
> Hi,
> Is there a way to iterate over a DataFrame with n partitions sequentially,
>
>
> Thanks,
> Sujeet
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Jakob Odersky

(Maybe unrelated FYI): in case you're using only Scala or Java with
Spark, I would recommend to use Datasets instead of DataFrames. They
provide exactly the same functionality, yet offer more type-safety.

On Thu, Sep 8, 2016 at 11:05 AM, Lee Becker  wrote:
>
> On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose 
> wrote:
>>
>> I wish to organize these dataframe operations by grouping them Scala
>> Object methods.
>> Something like below
>>
>>
>>> Object Driver {
>>> def main(args: Array[String]) {
>>>   val df = Operations.process(sparkContext)
>>>   }
>>> }
>>>
>>> Object Operations {
>>>   def process(sparkContext: SparkContext) : DataFrame = {
>>> //series of dataframe operations
>>>   }
>>> }
>>
>>
>> My stupid question is would retrieving DF from other Scala Object's method
>> as return type is right thing do in terms of large scale.
>> Would returning DF to driver will cause all data get passed to the driver
>> code or it would be return just pointer to the DF?
>
>
> As long as the methods do not trigger any executions, it is fine to pass a
> DataFrame back to the driver.  Think of a DataFrame as an abstraction over
> RDDs.  When you return an RDD or DataFrame you're not returning the object
> itself.  Instead you're returning a recipe that details the series of
> operations needed to produce the data.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Jakob Odersky

Spark currently requires at least Java 1.7, so adding a Java
1.8-specific encoder will not be straightforward without affecting
requirements. I can think of two solutions:

1. add a Java 1.8 build profile which includes such encoders (this may
be useful for Scala 2.12 support in the future as well)
2. expose a custom Encoder API (the current one is not easily extensible)

I would personally favor solution number 2 as it avoids adding yet
another build configuration to choose from, however I am not sure how
feasible it is to make custom encoders play nice with Catalyst.

To get back to your question, I don't think there are currently any
plans and I would recommend you work around the issue by converting to
the old Date API
http://stackoverflow.com/questions/33066904/localdate-to-java-util-date-and-vice-versa-simpliest-conversion

On Fri, Sep 2, 2016 at 8:29 AM, Daniel Siegmann
 wrote:
> It seems Spark can handle case classes with java.sql.Date, but not
> java.time.LocalDate. It complains there's no encoder.
>
> Are there any plans to add an encoder for LocalDate (and other classes in
> the new Java 8 Time and Date API), or is there an existing library I can use
> that provides encoders?
>
> --
> Daniel Siegmann
> Senior Software Engineer
> SecurityScorecard Inc.
> 214 W 29th Street, 5th Floor
> New York, NY 10001
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky

Forgot to answer your question about feature parity of Python w.r.t.
Spark's different components
I mostly work with scala so I can't say for sure but I think that all
pre-2.0 features (that's basically everything except Structured Streaming)
are on par. Structured Streaming is a pretty new feature and Python support
is currently not available. The API is not final however and I reckon that
Python support will arrive once it gets finalized, probably in the next
version.

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky

As you point out, often the reason that Python support lags behind is that
functionality is implemented in Scala, so the API in that language is
"free" whereas Python support needs to be added explicitly. Nevertheless,
Python bindings are an important part of Spark and is used by many people
(this info could be outdated but Python used to be the second most popular
language after Scala). I expect Python support to only get better in the
future so I think it is fair to say that Python is a first-class citizen in
Spark.

Regarding performance, the issue is more complicated. This is mostly due to
the fact that the actual execution of actions happens in JVM-land and any
correspondance between Python and the JVM is expensive. So the question
basically boils down to "how often does python need to communicate with the
JVM"? The answer depends on the Spark APIs you're using:

1. Plain old RDDs: for every function you pass to a transformation (filter,
map, etc) an intermediate result will be shipped to a Pyhon interpreter,
the function applied, and finally the result shipped back to the JVM.
2. DataFrames with RDD-like transformations or User Defined Functions: same
as point 1, any functions are applied in a Python environment and hence
data needs to be transferred.
3. DataFrames with only SQL expressions: Spark query optimizer will take
care of computing and executing an internal representation of your
transformations and no data communication needs to happen between Python
and the JVM (apart from final results in case you asked for them, i.e. by
calling a collect()).

In cases 1 and 2, there will be a lack in performance compared to
equivalent Scala or Java versions. The difference in case 3 is negligible
as all language APIs will share the same backend .See this blog post from
Databricks for some more detailed information:
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

I hope this was the kind of information you were looking for. Please note
however that performance in Spark is a complex topic, the scenarios I
mentioned above should nevertheless give you some rule of thumb.

best,
--Jakob

On Thu, Sep 1, 2016 at 11:25 PM, ayan guha  wrote:

> Tal: I think by nature of the project itself, Python APIs are developed
> after Scala and Java, and it is a fair trade off between speed of getting
> stuff to market. And more and more this discussion is progressing, I see
> not much issue in terms of feature parity.
>
> Coming back to performance, Darren raised a good point: if I can scale
> out, individual VM performance should not matter much. But performance is
> often stated as a definitive downside of using Python over scala/java. I am
> trying to understand the truth and myth behind this claim. Any pointer
> would be great.
>
> best
> Ayan
>
> On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
> wrote:
>
>>
>> On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:
>>
>>> This topic is a concern for us as well. In the data science world no one
>>> uses native scala or java by choice. It's R and Python. And python is
>>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>>
>>> This is why we have decoupled from spark in our project. It's really
>>> unfortunate spark team have invested so heavily in scale.
>>>
>>> As for speed it comes from horizontal scaling and throughout. When you
>>> can scale outward, individual VM performance is less an issue. Basic HPC
>>> principles.
>>>
>>
>> Darren,
>>
>> My guess is that data scientist who will decouple themselves from spark,
>> will eventually left with more or less nothing. (single process
>> capabilities, or purely performing HPC's) (unless, unlikely, some good
>> spark competitor will emerge.  unlikely, simply because there is no need
>> for such).
>> But putting guessing aside - the reason python is 3rd in line for feature
>> support, is not because the spark developers were busy with scala, it's
>> because the features that are missing are those that support strong typing.
>> which is not relevant to python.  in other words, even if spark was
>> rewritten in python, and was to focus on python only, you would still not
>> get those features.
>>
>>
>>
>> --
>> *Tal Grynbaum* / *CTO & co-founder*
>>
>> m# +972-54-7875797
>>
>> mobile retention done right
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky

I'm not sure how the shepherd thing works, but just FYI Michael
Armbrust originally wrote Catalyst, the engine behind Datasets.

You can find a list of all committers here
https://cwiki.apache.org/confluence/display/SPARK/Committers. Another
good resource is to check https://spark-prs.appspot.com/ and filter
issues by components. People frequently involved in comments on a
specific component will likely be quite knowledgeable in that area.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky

Hi Mich,

the functional difference between Datasets and DataFrames is virtually
non-existant in Spark 2.0. Historically, DataFrames were the first
implementation of a collection to use Catalyst, Spark SQL's query
optimizer. Whilst bringing lots of performance benefits, DataFrames came at
the expense of type safety since they are essentially a collection of "Row"
objects regardless of the actual data they represented. Datasets were added
in Spark 1.6 to add back type-safety whilst still taking advantage of
Catalyst. In Spark 2.0 DataFrame became an alias for Dataset.

Both Datasets and DataFrames are entry points to Catalyst that will run
queries (aka transformations and actions) "on top of" RDDs. You can think
of it this way: when applying an action on a Dataset, Catalyst basically
will try to figure out what sequence of RDD transformations correspond to
the query and are the most efficient (in practice it is slightly more
complex). Your statement that "Dataset is [...] basically an RDD with some
optimization gone into it" is true in that regard :)

best,
--Jakob

On Thu, Sep 1, 2016 at 3:15 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Thanks I have already seen that link.
>
> We were discussing this topic on another thread today.
>
> "Difference between Data set and Data Frame in Spark 2
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 September 2016 at 23:10, Peyman Mohajerian <mohaj...@gmail.com>
> wrote:
>
>> https://databricks.com/blog/2016/07/14/a-tale-of-three-apach
>> e-spark-apis-rdds-dataframes-and-datasets.html
>>
>> On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Jacob.
>>>
>>> My understanding of Dataset is that it is basically an RDD with some
>>> optimization gone into it. RDD is meant to deal with unstructured data?
>>>
>>> Now DataFrame is the tabular format of RDD designed for tabular work,
>>> csv, SQL stuff etc.
>>>
>>> When you mention DataFrame is just an alias for Dataset[Row] does that
>>> mean  that it converts an RDD to DataSet thus producing a tabular format?
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
>>>
>>>> > However, what really worries me is not having Dataset APIs at all in
>>>> Python. I think thats a deal breaker.
>>>>
>>>> What is the functionality you are missing? In Spark 2.0 a DataFrame is
>>>> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
>>>> core/.../o/a/s/sql/package.scala).
>>>> Since python is dynamically typed, you wouldn't really gain anything by
>>>> using Datasets anyway.
>>>>
>>>> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Thanks All for your replies.
>>>>>
>>>>> Feature Parity:
>>>>>
>>>>> MLLib, RDD and dataframes features are totally comparable. Streaming
>>>>> is now at par in functionality too, I believe. However, what really 
>>>>> worries
>>>>> me is not having Dataset APIs at a

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky

> However, what really worries me is not having Dataset APIs at all in
Python. I think thats a deal breaker.

What is the functionality you are missing? In Spark 2.0 a DataFrame is just
an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
core/.../o/a/s/sql/package.scala).
Since python is dynamically typed, you wouldn't really gain anything by
using Datasets anyway.

On Thu, Sep 1, 2016 at 2:20 PM, ayan guha  wrote:

> Thanks All for your replies.
>
> Feature Parity:
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
> Performance:
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali  wrote:
>
>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>> write a 10 page essay on why that wouldn't work so great. you might be
>> wondering why would spark have it then? well probably because its ease of
>> use for ML (that would be my best guess).
>>
>>
>>
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>> wrote:
>>
>>> I believe this would greatly depend on your use case and your
>>> familiarity with the languages.
>>>
>>>
>>>
>>> In general, scala would have a much better performance than python and
>>> not all interfaces are available in python.
>>>
>>> That said, if you are planning to use dataframes without any UDF then
>>> the performance hit is practically nonexistent.
>>>
>>> Even if you need UDF, it is possible to write those in scala and wrap
>>> them for python and still get away without the performance hit.
>>>
>>> Python does not have interfaces for UDAFs.
>>>
>>>
>>>
>>> I believe that if you have large structured data and do not generally
>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>
>>>
>>>
>>>
>>>
>>> *From:* ayan guha [mailto:[hidden email]
>>> ]
>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>> *To:* user
>>> *Subject:* Scala Vs Python
>>>
>>>
>>>
>>> Hi Users
>>>
>>>
>>>
>>> Thought to ask (again and again) the question: While I am building any
>>> production application, should I use Scala or Python?
>>>
>>>
>>>
>>> I have read many if not most articles but all seems pre-Spark 2.
>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>
>>>
>>>
>>> I am thinking performance, feature parity and future direction, not so
>>> much in terms of skillset or ease of use.
>>>
>>>
>>>
>>> Or, if you think it is a moot point, please say so as well.
>>>
>>>
>>>
>>> Any real life example, production experience, anecdotes, personal taste,
>>> profanity all are welcome :)
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>> Ayan Guha
>>>
>>> --
>>> View this message in context: RE: Scala Vs Python
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky

Hi Aris,
thanks for sharing this issue. I can confirm that value classes
currently don't work, however I can't think of reason why they
shouldn't be supported. I would therefore recommend that you report
this as a bug.

(Btw, value classes also currently aren't definable in the REPL. See
https://issues.apache.org/jira/browse/SPARK-17367)

regards,
--Jakob

On Thu, Sep 1, 2016 at 1:58 PM, Aris  wrote:
> Hello Spark community -
>
> Does Spark 2.0 Datasets *not support* Scala Value classes (basically
> "extends AnyVal" with a bunch of limitations) ?
>
> I am trying to do something like this:
>
> case class FeatureId(value: Int) extends AnyVal
> val seq = Seq(FeatureId(1),FeatureId(2),FeatureId(3))
> import spark.implicits._
> val ds = spark.createDataset(seq)
> ds.count
>
>
> This will compile, but then it will break at runtime with a cryptic error
> about "cannot find int at value". If I remove the "extends AnyVal" part,
> then everything works.
>
> Value classes are a great performance boost / static type checking feature
> in Scala, but are they prohibited in Spark Datasets?
>
> Thanks!
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to use custom class in DataSet

2016-08-30 Thread Jakob Odersky

Implementing custom encoders is unfortunately not well supported at
the moment (IIRC there are plans to eventually add an api for user
defined encoders).

That being said, there are a couple of encoders that can work with
generic, serializable data types: "javaSerialization" and "kryo",
found here 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Encoders$.
These encoders need to be specified explicitly, as in
"spark.createDataset(...)(Encoders.javaSerialization)"

In Spark 2.1 there will also be special trait
"org.apache.spark.sql.catalyst.DefinedByConstructorParams" that can be
mixed into arbitrary classes and that has implicit encoders available.

If you don't control the source of the class in question and it is not
serializable, it may still be possible to define your own Encoder by
implementing your own "o.a.s.sql.catalyst.encoders.ExpressionEncoder".
However, that requires quite some knowledge on how Spark's SQL
optimizer (catalyst) works internally and I don't think there is much
documentation on that.

regards,
--Jakob

On Mon, Aug 29, 2016 at 10:39 PM, canan chen  wrote:
>
> e.g. I have a custom class A (not case class), and I'd like to use it as
> DataSet[A]. I guess I need to implement Encoder for this, but didn't find
> any example for that, is there any document for that ? Thanks
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Error in Word Count Program

2016-07-19 Thread Jakob Odersky

Does the file /home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md exist?

On Tue, Jul 19, 2016 at 4:30 AM, RK Spark  wrote:

> val textFile = sc.textFile("README.md")val linesWithSpark =
> textFile.filter(line => line.contains("Spark"))
> linesWithSpark.saveAsTextFile("output1")
>
>
> Same error:
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md
>

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Jakob Odersky

Hi Eli,

to build spark, just run

build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests package

in your source directory, where package is the actual word "package".
This will recompile the whole project, so it may take a while when
running the first time.
Replacing a single file in an existing jar is not recommended unless
it is for a quick test, so I would also suggest that you give your
local spark compilation a custom version as to avoid any ambiguity if
you depend on it from somewhere else.

Check out this page
http://spark.apache.org/docs/1.4.1/building-spark.html for more
detailed information on the build process.

--jakob

On Tue, Jul 19, 2016 at 6:42 AM, Ted Yu  wrote:
> org.apache.spark.mllib.fpm is not a maven goal.
>
> -pl is For Individual Projects.
>
> Your first build action should not include -pl.
>
>
> On Tue, Jul 19, 2016 at 4:22 AM, Eli Super  wrote:
>>
>> Hi
>>
>> I have a windows laptop
>>
>> I just downloaded the spark 1.4.1 source code.
>>
>> I try to compile org.apache.spark.mllib.fpm with mvn
>>
>> My goal is to replace original org\apache\spark\mllib\fpm\* in
>> spark-assembly-1.4.1-hadoop2.6.0.jar
>>
>> As I understand from this link
>>
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
>>
>>
>> I need to execute following command : build/mvn package -DskipTests -pl
>> assembly
>> I executed : mvn org.apache.spark.mllib.fpm  -DskipTests -pl assembly
>>
>> Then I got an error
>>  [INFO] Scanning for projects...
>> [ERROR] [ERROR] Could not find the selected project in the reactor:
>> assembly @
>>
>> Thanks for any help
>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Jakob Odersky

Spark actually used to depend on Akka. Unfortunately this brought in
all of Akka's dependencies (in addition to Spark's already quite
complex dependency graph) and, as Todd mentioned, led to conflicts
with projects using both Spark and Akka.

It would probably be possible to use Akka and shade it to avoid
conflicts (some additional classloading tricks may be required).
However, considering that only a small portion of Akka's features was
used and scoped quite narrowly across Spark, it isn't worth the extra
maintenance burden. Furthermore, akka-remote uses Netty internally, so
reducing dependencies to core functionality is a good thing IMO

On Mon, May 23, 2016 at 6:35 AM, Todd  wrote:
> As far as I know, there would be Akka version conflicting issue when  using
> Akka as spark streaming source.
>
>
>
>
>
>
> At 2016-05-23 21:19:08, "Chaoqiang"  wrote:
>>I want to know why spark 1.6 use Netty instead of Akka? Is there some
>>difficult problems which Akka can not solve, but using Netty can solve
>>easily?
>>If not, can you give me some references about this changing?
>>Thank you.
>>
>>
>>
>>--
>>View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27004.html
>>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>-
>>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: I want to unsubscribe

2016-04-05 Thread Jakob Odersky

to unsubscribe, send an email to user-unsubscr...@spark.apache.org

On Tue, Apr 5, 2016 at 4:50 PM, Ranjana Rajendran
 wrote:
> I get to see the threads in the public mailing list. I don;t want so many
> messages in my inbox. I want to unsubscribe.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Building spark submodule source code

2016-03-21 Thread Jakob Odersky

Another gotcha to watch out for are the SPARK_* environment variables.
Have you exported SPARK_HOME? In that case, 'spark-shell' will use
Spark from the variable, regardless of the place the script is called
from.
I.e. if SPARK_HOME points to a release version of Spark, your code
changes will never be available by simply running 'spark-shell'.

On Sun, Mar 20, 2016 at 11:23 PM, Akhil Das  wrote:
> Have a look at the intellij setup
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
> Once you have the setup ready, you don't have to recompile the whole stuff
> every time.
>
> Thanks
> Best Regards
>
> On Mon, Mar 21, 2016 at 8:14 AM, Tenghuan He  wrote:
>>
>> Hi everyone,
>>
>> I am trying to add a new method to spark RDD. After changing the code
>> of RDD.scala and running the following command
>> mvn -pl :spark-core_2.10 -DskipTests clean install
>> It BUILD SUCCESS, however, when starting the bin\spark-shell, my
>> method cannot be found.
>> Do I have to rebuild the whole spark project instead the spark-core
>> submodule to make the changes work?
>> Rebuiling the whole project is too time consuming, is there any better
>> choice?
>>
>>
>> Thanks & Best Regards
>>
>> Tenghuan He
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky

Can you share a snippet that reproduces the error? What was
spark.sql.autoBroadcastJoinThreshold before your last change?

On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový  wrote:
> Hi,
>
> any idea what could be causing this issue? It started appearing after
> changing parameter
>
> spark.sql.autoBroadcastJoinThreshold to 10
>
>
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal
> numbers of partitions
> at
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.PartitionCoalescer.(CoalescedRDD.scala:172)
> at
> org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:85)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
> at
> org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
> ... 28 more
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ClassNotFoundException in RDD.map

2016-03-20 Thread Jakob Odersky

The error is very strange indeed, however without code that reproduces
it, we can't really provide much help beyond speculation.

One thing that stood out to me immediately is that you say you have an
RDD of Any where every Any should be a BigDecimal, so why not specify
that type information?
When using Any, a whole class of errors, that normally the typechecker
could catch, can slip through.

On Thu, Mar 17, 2016 at 10:25 AM, Dirceu Semighini Filho
 wrote:
> Hi Ted, thanks for answering.
> The map is just that, whenever I try inside the map it throws this
> ClassNotFoundException, even if I do map(f => f) it throws the exception.
> What is bothering me is that when I do a take or a first it returns the
> result, which make me conclude that the previous code isn't wrong.
>
> Kind Regards,
> Dirceu
>
>
> 2016-03-17 12:50 GMT-03:00 Ted Yu :
>>
>> bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>>
>> Do you mind showing more of your code involving the map() ?
>>
>> On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho
>>  wrote:
>>>
>>> Hello,
>>> I found a strange behavior after executing a prediction with MLIB.
>>> My code return an RDD[(Any,Double)] where Any is the id of my dataset,
>>> which is BigDecimal, and Double is the prediction for that line.
>>> When I run
>>> myRdd.take(10) it returns ok
>>> res16: Array[_ >: (Double, Double) <: (Any, Double)] =
>>> Array((1921821857196754403.00,0.1690292052496703),
>>> (454575632374427.00,0.16902820241892452),
>>> (989198096568001939.00,0.16903432789699502),
>>> (14284129652106187990.00,0.16903517653451386),
>>> (17980228074225252497.00,0.16903151028332508),
>>> (3861345958263692781.00,0.16903056986183976),
>>> (17558198701997383205.00,0.1690295450319745),
>>> (10651576092054552310.00,0.1690286445174418),
>>> (4534494349035056215.00,0.16903303401862327),
>>> (5551671513234217935.00,0.16902303368995966))
>>> But when I try to run some map on it:
>>> myRdd.map(_._1).take(10)
>>> It throws a ClassCastException:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>>> in stage 72.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 72.0 (TID 1774, 172.31.23.208): java.lang.ClassNotFoundException:
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:278)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>>> at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>>

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Jakob Odersky

Doesn't FileInputFormat require type parameters? Like so:

class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]
extends FileInputFormat[LW, RD]

I haven't verified this but it could be related to the compile error
you're getting.

On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang  wrote:
> I would say change
>
> class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends
> FileInputFormat
>
> to
>
> class RawDataInputFormat[LongWritable, RDRawDataRecord] extends
> FileInputFormat
>
>
> On Thu, Mar 17, 2016 at 9:48 AM, Mich Talebzadeh 
> wrote:
>>
>> Hi Tony,
>>
>> Is
>>
>> com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord
>>
>> One of your own packages?
>>
>> Sounds like it is one throwing the error
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 17 March 2016 at 15:21, Tony Liu  wrote:
>>>
>>> Hi,
>>>My HDFS file is store with custom data structures. I want to read it
>>> with SparkContext object.So I define a formatting object:
>>>
>>> 1. code of RawDataInputFormat.scala
>>>
>>> import com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord
>>> import org.apache.hadoop.io.LongWritable
>>> import org.apache.hadoop.mapred._
>>>
>>> /**
>>>   * Created by Tony on 3/16/16.
>>>   */
>>> class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]
>>> extends FileInputFormat {
>>>
>>>   override def getRecordReader(split: InputSplit, job: JobConf, reporter:
>>> Reporter): RecordReader[LW, RD] = {
>>> new RawReader(split, job, reporter)
>>>   }
>>>
>>> }
>>>
>>> 2. code of RawReader.scala
>>>
>>> import com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord
>>> import org.apache.hadoop.io.{LongWritable, SequenceFile}
>>> import org.apache.hadoop.mapred._
>>>
>>> /**
>>>   * Created by Tony on 3/17/16.
>>>   */
>>> class RawReader[LW <: LongWritable, RD <: RDRawDataRecord] extends
>>> RecordReader[LW, RD] {
>>>
>>>   var reader: SequenceFile.Reader = null
>>>   var currentPos: Long = 0L
>>>   var length: Long = 0L
>>>
>>>   def this(split: InputSplit, job: JobConf, reporter: Reporter) {
>>> this()
>>> val p = (split.asInstanceOf[FileSplit]).getPath
>>> reader = new SequenceFile.Reader(job, SequenceFile.Reader.file(p))
>>>   }
>>>
>>>   override def next(key: LW, value: RD): Boolean = {
>>> val flag = reader.next(key, value)
>>> currentPos = reader.getPosition()
>>> flag
>>>   }
>>>
>>>   override def getProgress: Float = Math.min(1.0f, currentPos /
>>> length.toFloat)
>>>
>>>   override def getPos: Long = currentPos
>>>
>>>   override def createKey(): LongWritable = {
>>> new LongWritable()
>>>   }
>>>
>>>   override def close(): Unit = {
>>> reader.close()
>>>   }
>>>
>>>   override def createValue(): RDRawDataRecord = {
>>> new RDRawDataRecord()
>>>   }
>>> }
>>>
>>> 3. code of RDRawDataRecord.scala
>>>
>>> import com.kiisoo.aegis.common.rawdata.RawDataRecord;
>>> import java.io.DataInput;
>>> import java.io.DataOutput;
>>> import java.io.IOException;
>>> import org.apache.commons.lang.StringUtils;
>>> import org.apache.hadoop.io.Writable;
>>>
>>> public class RDRawDataRecord implements Writable {
>>> private String smac;
>>> private String dmac;
>>> private int hrssi;
>>> private int lrssi;
>>> private long fstamp;
>>> private long lstamp;
>>> private long maxstamp;
>>> private long minstamp;
>>> private long stamp;
>>>
>>> public void readFields(DataInput in) throws IOException {
>>> this.smac = in.readUTF();
>>> this.dmac = in.readUTF();
>>> this.hrssi = in.readInt();
>>> this.lrssi = in.readInt();
>>> this.fstamp = in.readLong();
>>> this.lstamp = in.readLong();
>>> this.maxstamp = in.readLong();
>>> this.minstamp = in.readLong();
>>> this.stamp = in.readLong();
>>> }
>>>
>>> public void write(DataOutput out) throws IOException {
>>> out.writeUTF(StringUtils.isNotBlank(this.smac)?this.smac:"");
>>> out.writeUTF(StringUtils.isNotBlank(this.dmac)?this.dmac:"");
>>> out.writeInt(this.hrssi);
>>> out.writeInt(this.lrssi);
>>> out.writeLong(this.fstamp);
>>> out.writeLong(this.lstamp);
>>> out.writeLong(this.maxstamp);
>>> out.writeLong(this.minstamp);
>>> out.writeLong(this.stamp);
>>> }
>>>
>>> /**
>>>
>>> ignore getter setter
>>>
>>> **/
>>>
>>> }
>>>
>>> At last, I use this code to run:
>>>
>>> val filePath =
>>> "hdfs://tony.Liu:9000/wifi-raw-data/wifi-raw-data.1455206402044"
>>> val conf = new SparkConf()
>>> conf.setMaster("local")
>>> conf.setAppName("demo")
>>> val sc = new SparkContext(conf)
>>> val file = sc.hadoopFile[LongWritable, RDRawDataRecord,
>>> RawDataInputFormat[LongWritable,

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky

Hi,
regarding 1, packages are resolved locally. That means that when you
specify a package, spark-submit will resolve the dependencies and
download any jars on the local machine, before shipping* them to the
cluster. So, without a priori knowledge of dataproc clusters, it
should be no different to specify packages.

Unfortunatly I can't help with 2.

--Jakob

*shipping in this case means making them available via the network

On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale  wrote:
> Hi all,
>
> I had couple of questions.
> 1. Is there documentation on how to add the graphframes or any other package
> for that matter on the google dataproc managed spark clusters ?
>
> 2. Is there a way to add a package to an existing pyspark context through a
> jupyter notebook ?
>
> --aj

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky

> But I guess I cannot add a package once i launch the pyspark context right ?

Correct. Potentially, if you really really wanted to, you could maybe
(with lots of pain) load packages dynamically with some class-loader
black magic, but Spark does not provide that functionality.

On Thu, Mar 17, 2016 at 7:20 PM, Ajinkya Kale <kaleajin...@gmail.com> wrote:
> Thanks Jakob, Felix. I am aware you can do it with --packages but i was
> wondering if there is a way to do something like "!pip install "
> like i do for other packages from jupyter notebook for python. But I guess I
> cannot add a package once i launch the pyspark context right ?
>
> On Thu, Mar 17, 2016 at 6:59 PM Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>>
>> For some, like graphframes that are Spark packages, you could also use
>> --packages in the command line of spark-submit or pyspark. See
>> http://spark.apache.org/docs/latest/submitting-applications.html
>>
>> _
>> From: Jakob Odersky <ja...@odersky.com>
>> Sent: Thursday, March 17, 2016 6:40 PM
>> Subject: Re: installing packages with pyspark
>> To: Ajinkya Kale <kaleajin...@gmail.com>
>> Cc: <user@spark.apache.org>
>>
>>
>> Hi,
>> regarding 1, packages are resolved locally. That means that when you
>> specify a package, spark-submit will resolve the dependencies and
>> download any jars on the local machine, before shipping* them to the
>> cluster. So, without a priori knowledge of dataproc clusters, it
>> should be no different to specify packages.
>>
>> Unfortunatly I can't help with 2.
>>
>> --Jakob
>>
>> *shipping in this case means making them available via the network
>>
>> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale <kaleajin...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I had couple of questions.
>> > 1. Is there documentation on how to add the graphframes or any other
>> > package
>> > for that matter on the google dataproc managed spark clusters ?
>> >
>> > 2. Is there a way to add a package to an existing pyspark context
>> > through a
>> > jupyter notebook ?
>> >
>> > --aj
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky

The artifactId in maven basically (in a simple case) corresponds to name in sbt.

Note however that you will manually need to append the
_scalaBinaryVersion to the artifactId in case you would like to build
against multiple scala versions (otherwise maven will overwrite the
generated jar with the latest one).


On Tue, Mar 15, 2016 at 4:27 PM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> ok  Ted
>
> In sbt I have
>
> name := "ImportCSV"
> version := "1.0"
> scalaVersion := "2.10.4"
>
> which ends up in importcsv_2.10-1.0.jar as part of
> target/scala-2.10/importcsv_2.10-1.0.jar
>
> In mvn I have
>
> 1.0
> scala
>
>
> Does it matter?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 15 March 2016 at 23:17, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> 1.0
>> ...
>> scala
>>
>> On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>>>
>>> An observation
>>>
>>> Once compiled with MVN the job submit works as follows:
>>>
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>>> --num-executors=2 target/scala-1.0.jar
>>>
>>> With sbt it takes this form
>>>
>>> + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages
>>> com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master
>>> spark://50.140.197.217:7077 --executor-memory=12G --executor-cores=12
>>> --num-executors=2 target/scala-2.10/importcsv_2.10-1.0.jar
>>>
>>> They both return the same results. However, why mvnjar file name is
>>> different (may be a naive question!)?
>>>
>>> thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 15 March 2016 at 22:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>>
>>>> Many thanks Ted and thanks for heads up Jakob
>>>>
>>>> Just these two changes to dependencies
>>>>
>>>> 
>>>> org.apache.spark
>>>> spark-core_2.10
>>>> 1.5.1
>>>> 
>>>> 
>>>> org.apache.spark
>>>> spark-sql_2.10
>>>> 1.5.1
>>>> 
>>>>
>>>>
>>>> [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0
>>>> [INFO]
>>>> 
>>>> [INFO] BUILD SUCCESS
>>>> [INFO]
>>>> 
>>>> [INFO] Total time: 01:04 min
>>>> [INFO] Finished at: 2016-03-15T22:55:08+00:00
>>>> [INFO] Final Memory: 32M/1089M
>>>> [INFO]
>>>> 
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>>
>>>> On 15 March 2016 at 22:18, Jakob Odersky <ja...@odersky.com> wrote:
>>>>>
>>>>> Hi Mich,
>>>>> probably unrelated to the current error you're seeing, however the
>>>>> following dependencies will bite you later:
>>>>> spark-hive_2.10
>>>>> spark-csv_2.11
>>>>> the problem here is that you're using libraries built for different
>>>>> Scala binary versions (the numbers after the underscore). The simple
>>>>> fix here is to specify the Scala binary version you're project builds
>>>>> for (2.10 in your case, however note that version is EOL, you should
>>>>> upgrade to scala 2.11.8 if possible).
>>

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky

Hi Mich,
probably unrelated to the current error you're seeing, however the
following dependencies will bite you later:
spark-hive_2.10
spark-csv_2.11
the problem here is that you're using libraries built for different
Scala binary versions (the numbers after the underscore). The simple
fix here is to specify the Scala binary version you're project builds
for (2.10 in your case, however note that version is EOL, you should
upgrade to scala 2.11.8 if possible).

On a side note, sbt takes care of handling correct scala versions for
you (the double %% actually is a shorthand for appending
"_scalaBinaryVersion" to your dependency). It also enables you to
build and publish your project seamlessly against multiple versions. I
would strongly recommend to use it in Scala projects.

cheers,
--Jakob



On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh
 wrote:
> Hi,
>
> I normally use sbt and using this sbt file works fine for me
>
>  cat ImportCSV.sbt
> name := "ImportCSV"
> version := "1.0"
> scalaVersion := "2.10.4"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.3.0"
>
> This is my first trial using Mavan and pom
>
>
> my pom.xml file looks like this but throws error at build
>
>
> [DEBUG]   com.univocity:univocity-parsers:jar:1.5.1:compile
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 1.326 s
> [INFO] Finished at: 2016-03-15T22:17:29+00:00
> [INFO] Final Memory: 14M/455M
> [INFO]
> 
> [ERROR] Failed to execute goal on project scala: Could not resolve
> dependencies for project spark:scala:jar:1.0: The following artifacts could
> not be resolved: org.apache.spark:spark-core:jar:1.5.1,
> org.apache.spark:spark-sql:jar:1.5.1: Failure to find
> org.apache.spark:spark-core:jar:1.5.1 in
> https://repo.maven.apache.org/maven2 was cached in the local repository,
> resolution will not be reattempted until the update interval of central has
> elapsed or updates are forced -> [Help 1]
>
>
> My pom file is
>
>
>  cat pom.xml
> http://maven.apache.org/POM/4.0.0;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> spark
> 1.0
> ${project.artifactId}
>
> 
> 1.7
> 1.7
> UTF-8
> 2.10.4
> 2.15.2
> 
>
> 
>   
> org.scala-lang
> scala-library
> 2.10.2
>   
> 
> org.apache.spark
> spark-core
> 1.5.1
> 
> 
> org.apache.spark
> spark-sql
> 1.5.1
> 
> 
> org.apache.spark
> spark-hive_2.10
> 1.5.0
> 
> 
> com.databricks
> spark-csv_2.11
> 1.3.0
> 
> 
>
> 
> src/main/scala
> 
> 
> org.scala-tools
> maven-scala-plugin
> ${maven-scala-plugin.version}
> 
> 
> 
> compile
> 
> 
> 
> 
> 
> -Xms64m
> -Xmx1024m
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 1.6
> 
> 
> package
> 
> shade
> 
> 
> 
> 
> *:*
> 
> META-INF/*.SF
> META-INF/*.DSA
> META-INF/*.RSA
> 
> 
> 
> 
>  implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
> com.group.id.Launcher1
> 
> 
> 
> 
> 
> 
> 
> 
>
> scala
> 
>
>
> I am sure I have omitted something?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-14 Thread Jakob Odersky

Have you tried setting the configuration
`spark.executor.extraLibraryPath` to point to a location where your
.so's are available? (Not sure if non-local files, such as HDFS, are
supported)

On Mon, Mar 14, 2016 at 2:12 PM, Tristan Nixon  wrote:
> What build system are you using to compile your code?
> If you use a dependency management system like maven or sbt, then you should 
> be able to instruct it to build a single jar that contains all the other 
> dependencies, including third-party jars and .so’s. I am a maven user myself, 
> and I use the shade plugin for this:
> https://maven.apache.org/plugins/maven-shade-plugin/
>
> However, if you are using SBT or another dependency manager, someone else on 
> this list may be able to give you help on that.
>
> If you’re not using a dependency manager - well, you should be. Trying to 
> manage this manually is a pain that you do not want to get in the way of your 
> project. There are perfectly good tools to do this for you; use them.
>
>> On Mar 14, 2016, at 3:56 PM, prateek arora  
>> wrote:
>>
>> Hi
>>
>> Thanks for the information .
>>
>> but my problem is that if i want to write spark application which depend on
>> third party libraries like opencv then whats is the best approach to
>> distribute all .so and jar file of opencv in all cluster ?
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-distribute-dependent-files-so-jar-across-spark-worker-nodes-tp26464p26489.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky

regarding my previous message, I forgot to mention to run netstat as
root (sudo netstat -plunt)
sorry for the noise

On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky <ja...@odersky.com> wrote:
> Some more diagnostics/suggestions:
>
> 1) are other services listening to ports in the 4000 range (run
> "netstat -plunt")? Maybe there is an issue with the error message
> itself.
>
> 2) are you sure the correct java version is used? java -version
>
> 3) can you revert all installation attempts you have done so far,
> including files installed by brew/macports or maven and try again?
>
> 4) are there any special firewall rules in place, forbidding
> connections on localhost?
>
> This is very weird behavior you're seeing. Spark is supposed to work
> out-of-the-box with ZERO configuration necessary for running a local
> shell. Again, my prime suspect is a previous, failed Spark
> installation messing up your config.
>
> On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then 
>> you’re the superuser.
>> However, as mentioned below, I don’t think its a relevant factor.
>>
>>> On Mar 10, 2016, at 12:02 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>>
>>> Hi Tristan,
>>>
>>> I'm afraid I wouldn't know whether I'm running it as super user.
>>>
>>> I have java version 1.8.0_73 and SCALA version 2.11.7
>>>
>>> Sent from my iPhone
>>>
>>>> On 9 Mar 2016, at 21:58, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>>
>>>> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
>>>> fresh 1.6.0 tarball,
>>>> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
>>>> port is some randomly generated large number.
>>>> So SPARK_HOME is definitely not needed to run this.
>>>>
>>>> Aida, you are not running this as the super-user, are you?  What versions 
>>>> of Java & Scala do you have installed?
>>>>
>>>>> On Mar 9, 2016, at 3:53 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>>>>
>>>>> Hi Jakob,
>>>>>
>>>>> Tried running the command env|grep SPARK; nothing comes back
>>>>>
>>>>> Tried env|grep Spark; which is the directory I created for Spark once I 
>>>>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
>>>>>
>>>>> Tried running ./bin/spark-shell ; comes back with same error as below; 
>>>>> i.e could not bind to port 0 etc.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On 9 Mar 2016, at 21:42, Jakob Odersky <ja...@odersky.com> wrote:
>>>>>>
>>>>>> As Tristan mentioned, it looks as though Spark is trying to bind on
>>>>>> port 0 and then 1 (which is not allowed). Could it be that some
>>>>>> environment variables from you previous installation attempts are
>>>>>> polluting your configuration?
>>>>>> What does running "env | grep SPARK" show you?
>>>>>>
>>>>>> Also, try running just "/bin/spark-shell" (without the --master
>>>>>> argument), maybe your shell is doing some funky stuff with the
>>>>>> brackets.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>
> On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then 
>> you’re the superuser.
>> However, as mentioned below, I don’t think its a relevant factor.
>>
>>> On Mar 10, 2016, at 12:02 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>>
>>> Hi Tristan,
>>>
>>> I'm afraid I wouldn't know whether I'm running it as super user.
>>>
>>> I have java version 1.8.0_73 and SCALA version 2.11.7
>>>
>>> Sent from my iPhone
>>>
>>>

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky

Some more diagnostics/suggestions:

1) are other services listening to ports in the 4000 range (run
"netstat -plunt")? Maybe there is an issue with the error message
itself.

2) are you sure the correct java version is used? java -version

3) can you revert all installation attempts you have done so far,
including files installed by brew/macports or maven and try again?

4) are there any special firewall rules in place, forbidding
connections on localhost?

This is very weird behavior you're seeing. Spark is supposed to work
out-of-the-box with ZERO configuration necessary for running a local
shell. Again, my prime suspect is a previous, failed Spark
installation messing up your config.

On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then you’re 
> the superuser.
> However, as mentioned below, I don’t think its a relevant factor.
>
>> On Mar 10, 2016, at 12:02 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>
>> Hi Tristan,
>>
>> I'm afraid I wouldn't know whether I'm running it as super user.
>>
>> I have java version 1.8.0_73 and SCALA version 2.11.7
>>
>> Sent from my iPhone
>>
>>> On 9 Mar 2016, at 21:58, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>
>>> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
>>> fresh 1.6.0 tarball,
>>> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
>>> port is some randomly generated large number.
>>> So SPARK_HOME is definitely not needed to run this.
>>>
>>> Aida, you are not running this as the super-user, are you?  What versions 
>>> of Java & Scala do you have installed?
>>>
>>>> On Mar 9, 2016, at 3:53 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>>>
>>>> Hi Jakob,
>>>>
>>>> Tried running the command env|grep SPARK; nothing comes back
>>>>
>>>> Tried env|grep Spark; which is the directory I created for Spark once I 
>>>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
>>>>
>>>> Tried running ./bin/spark-shell ; comes back with same error as below; i.e 
>>>> could not bind to port 0 etc.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On 9 Mar 2016, at 21:42, Jakob Odersky <ja...@odersky.com> wrote:
>>>>>
>>>>> As Tristan mentioned, it looks as though Spark is trying to bind on
>>>>> port 0 and then 1 (which is not allowed). Could it be that some
>>>>> environment variables from you previous installation attempts are
>>>>> polluting your configuration?
>>>>> What does running "env | grep SPARK" show you?
>>>>>
>>>>> Also, try running just "/bin/spark-shell" (without the --master
>>>>> argument), maybe your shell is doing some funky stuff with the
>>>>> brackets.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

On Thu, Mar 10, 2016 at 12:24 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
> If you type ‘whoami’ in the terminal, and it responds with ‘root’ then you’re 
> the superuser.
> However, as mentioned below, I don’t think its a relevant factor.
>
>> On Mar 10, 2016, at 12:02 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>
>> Hi Tristan,
>>
>> I'm afraid I wouldn't know whether I'm running it as super user.
>>
>> I have java version 1.8.0_73 and SCALA version 2.11.7
>>
>> Sent from my iPhone
>>
>>> On 9 Mar 2016, at 21:58, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>
>>> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
>>> fresh 1.6.0 tarball,
>>> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
>>> port is some randomly generated large number.
>>> So SPARK_HOME is definitely not needed to run this.
>>>
>>> Aida, you are not running this as the super-user, are you?  What versions 
>>> of Java & Scala do you have installed?
>>>
>>>

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky

Sorry had a typo in my previous message:

> try running just "/bin/spark-shell"

please remove the leading slash (/)

On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera  wrote:
> Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to
> set it. How would I do that?
>
> Thanks
>
> Sent from my iPhone
>
> On 9 Mar 2016, at 21:35, Tristan Nixon  wrote:
>
> No, those look like the right directions… It *should* work, but clearly is
> not. Hmmm…
>
> You can check if the spark home is set by:
> echo $SPARK_HOME
>
> but that doesn’t seem to be the issue.
>
> On Mar 9, 2016, at 2:58 PM, Aida Tefera  wrote:
>
> Hi Tristan, my apologies, I meant to write Spark and not SCALA
>
> I feel a bit lost at the moment...
>
> Perhaps I have missed steps that are implicit to more experienced people
>
> Apart from downloading spark and then following Jakob's steps:
>
> 1.
> curlhttp://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
> -O
>
> 2. tar -xzf spark-1.6.0-bin-hadoop2.6.tgz
>
> 3. cd spark-1.6.0-bin-hadoop2.6
>
> 4. ./bin/spark-shell --master local[2]
>
>
> Was I supposed to do something additional to this?
>
> How would I be able to determine if the spark_home is looking into a
> different directory?
>
>
> Sent from my iPhone
>
> On 9 Mar 2016, at 20:39, Tristan Nixon  wrote:
>
>
> SPARK_HOME and SCALA_HOME are different. I was just wondering whether spark
> is looking in a different dir for the config files than where you’re running
> it. If you have not set SPARK_HOME, it should look in the current directory
> for the /conf dir.
>
>
> The defaults should be relatively safe, I’ve been using them with local mode
> on my Mac for a long while without any need to change them.
>
>
> On Mar 9, 2016, at 2:20 PM, Aida Tefera  wrote:
>
>
> I don't think I set the SCALA_HOME environment variable
>
>
> Also, I'm unsure whether or not I should launch the scripts defaults to a
> single machine(local host)
>
>
> Sent from my iPhone
>
>
> On 9 Mar 2016, at 19:59, Tristan Nixon  wrote:
>
>
> Also, do you have the SPARK_HOME environment variable set in your shell, and
> if so what is it set to?
>
>
> On Mar 9, 2016, at 1:53 PM, Tristan Nixon  wrote:
>
>
> There should be a /conf sub-directory wherever you installed spark, which
> contains several configuration files.
>
> I believe that the two that you should look at are
>
> spark-defaults.conf
>
> spark-env.sh
>
>
>
> On Mar 9, 2016, at 1:45 PM, Aida Tefera  wrote:
>
>
> Hi Tristan, thanks for your message
>
>
> When I look at the spark-defaults.conf.template it shows a spark
> example(spark://master:7077) where the port is 7077
>
>
> When you say look to the conf scripts, how do you mean?
>
>
> Sent from my iPhone
>
>
> On 9 Mar 2016, at 19:32, Tristan Nixon  wrote:
>
>
> Yeah, according to the standalone documentation
>
> http://spark.apache.org/docs/latest/spark-standalone.html
>
>
> the default port should be 7077, which means that something must be
> overriding this on your installation - look to the conf scripts!
>
>
> On Mar 9, 2016, at 1:26 PM, Tristan Nixon  wrote:
>
>
> Looks like it’s trying to bind on port 0, then 1.
>
> Often the low-numbered ports are restricted to system processes and
> “established” servers (web, ssh, etc.) and
>
> so user programs are prevented from binding on them. The default should be
> to run on a high-numbered port like 8080 or such.
>
>
> What do you have in your spark-env.sh?
>
>
> On Mar 9, 2016, at 12:35 PM, Aida  wrote:
>
>
> Hi everyone, thanks for all your support
>
>
> I went with your suggestion Cody/Jakob and downloaded a pre-built version
>
> with Hadoop this time and I think I am finally making some progress :)
>
>
>
> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
>
> local[2]
>
> log4j:WARN No appenders could be found for logger
>
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>
> more info.
>
> Using Spark's repl log4j profile:
>
> org/apache/spark/log4j-defaults-repl.properties
>
> To adjust logging level use sc.setLogLevel("INFO")
>
> Welcome to
>
>   __
>
> / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
> /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>
> /_/
>
>
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>
> 1.8.0_73)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>
> 0. Attempting port 1.
>
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>
> 0. Attempting port 1.
>
> 16/03/09

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky

As Tristan mentioned, it looks as though Spark is trying to bind on
port 0 and then 1 (which is not allowed). Could it be that some
environment variables from you previous installation attempts are
polluting your configuration?
What does running "env | grep SPARK" show you?

Also, try running just "/bin/spark-shell" (without the --master
argument), maybe your shell is doing some funky stuff with the
brackets.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Installing Spark on Mac

2016-03-08 Thread Jakob Odersky

I've had some issues myself with the user-provided-Hadoop version.
If you simply just want to get started, I would recommend downloading
Spark (pre-built, with any of the hadoop versions) as Cody suggested.

A simple step-by-step guide:

1. curl http://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
-O

2. tar -xzf spark-1.6.0-bin-hadoop2.6.tgz

3. cd spark-1.6.0-bin-hadoop2.6

4. ./bin/spark-shell --master local[2]

On Tue, Mar 8, 2016 at 2:01 PM, Aida Tefera  wrote:
> Ok, once I downloaded the pre built version, I created a directory for it and 
> named Spark
>
> When I try ./bin/start-all.sh
>
> It comes back with : no such file or directory
>
> When I try ./bin/spark-shell --master local[2]
>
> I get: no such file or directory
> Failed to find spark assembly, you need to build Spark before running this 
> program
>
>
>
> Sent from my iPhone
>
>> On 8 Mar 2016, at 21:50, Cody Koeninger  wrote:
>>
>> That's what I'm saying, there is no "installing" necessary for
>> pre-built packages.  Just unpack it and change directory into it.
>>
>> What happens when you do
>>
>> ./bin/spark-shell --master local[2]
>>
>> or
>>
>> ./bin/start-all.sh
>>
>>
>>
>>> On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera  wrote:
>>> Hi Cody, thanks for your reply
>>>
>>> I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up 
>>> with errors.
>>>
>>> I have looked at the below links, doesn't give much detail on how to 
>>> install it before executing "./sbin/start-master.sh"
>>>
>>> Thanks,
>>>
>>> Aida
>>> Sent from my iPhone
>>>
 On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:

 You said you downloaded a prebuilt version.

 You shouldn't have to mess with maven or building spark at all.  All
 you need is a jvm, which it looks like you already have installed.

 You should be able to follow the instructions at

 http://spark.apache.org/docs/latest/

 and

 http://spark.apache.org/docs/latest/spark-standalone.html

 If you want standalone mode (master and several worker processes on
 your machine) rather than local mode (single process on your machine),
 you need to set up passwordless ssh to localhost

 http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x



 On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
  wrote:
> Hi Aida,
> The installation has detected a maven version 3.0.3. Update to 3.3.3 and 
> try
> again.
>
> Il 08/Mar/2016 14:06, "Aida"  ha scritto:
>>
>> Hi all,
>>
>> Thanks everyone for your responses; really appreciate it.
>>
>> Eduardo - I tried your suggestions but ran into some issues, please see
>> below:
>>
>> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
>> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
>> Using `mvn` from path: /usr/bin/mvn
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=512M;
>> support was removed in 8.0
>> [INFO] Scanning for projects...
>> [INFO]
>> 
>> [INFO] Reactor Build Order:
>> [INFO]
>> [INFO] Spark Project Parent POM
>> [INFO] Spark Project Test Tags
>> [INFO] Spark Project Launcher
>> [INFO] Spark Project Networking
>> [INFO] Spark Project Shuffle Streaming Service
>> [INFO] Spark Project Unsafe
>> [INFO] Spark Project Core
>> [INFO] Spark Project Bagel
>> [INFO] Spark Project GraphX
>> [INFO] Spark Project Streaming
>> [INFO] Spark Project Catalyst
>> [INFO] Spark Project SQL
>> [INFO] Spark Project ML Library
>> [INFO] Spark Project Tools
>> [INFO] Spark Project Hive
>> [INFO] Spark Project Docker Integration Tests
>> [INFO] Spark Project REPL
>> [INFO] Spark Project Assembly
>> [INFO] Spark Project External Twitter
>> [INFO] Spark Project External Flume Sink
>> [INFO] Spark Project External Flume
>> [INFO] Spark Project External Flume Assembly
>> [INFO] Spark Project External MQTT
>> [INFO] Spark Project External MQTT Assembly
>> [INFO] Spark Project External ZeroMQ
>> [INFO] Spark Project External Kafka
>> [INFO] Spark Project Examples
>> [INFO] Spark Project External Kafka Assembly
>> [INFO]
>> [INFO]
>> 
>> [INFO] Building Spark Project Parent POM 1.6.0
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
>> spark-parent_2.10 ---
>> [INFO]
>> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
>>

Re: How could I do this algorithm in Spark?

2016-02-24 Thread Jakob Odersky

Hi Guillermo,
assuming that the first "a,b" is a typo and you actually meant "a,d",
this is a sorting problem.

You could easily model your data as an RDD or tuples (or as a
dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
methods.

best,
--Jakob

On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz  wrote:
> I want to do some algorithm in Spark.. I know how to do it in a single
> machine where all data are together, but I don't know a good way to do it in
> Spark.
>
> If someone has an idea..
> I have some data like this
> a , b
> x , y
> b , c
> y , y
> c , d
>
> I want something like:
> a , d
> b , d
> c , d
> x , y
> y , y
>
> I need to know that a->b->c->d, so a->d, b->d and c->d.
> I don't want the code, just an idea how I could deal with it.
>
> Any idea?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Jakob Odersky

You can `filter` (scaladoc
)
your dataframes before saving them to- or after reading them from parquet
files

On Wed, Feb 24, 2016 at 1:28 AM, Cheng Lian  wrote:

> Parquet is a read-only format. So the only way to remove data from a
> written Parquet file is to write a new Parquet file without unwanted rows.
>
> Cheng
>
>
> On 2/17/16 5:11 AM, SRK wrote:
>
>> Hi,
>>
>> I am saving my records in the form of parquet files using dataframes in
>> hdfs. How to delete the records using dataframes?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-a-record-from-parquet-files-using-dataframes-tp26242.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SparkMaster IP

2016-02-22 Thread Jakob Odersky

Spark master by default binds to whatever ip address your current host
resolves to. You have a few options to change that:
- override the ip by setting the environment variable SPARK_LOCAL_IP
- change the ip in your local "hosts" file (/etc/hosts on linux, not
sure on windows)
- specify a different hostname such as "localhost" when starting spark
master by passing the "--host HOSTNAME" command-line parameter (the ip
address will be resolved from the supplied HOSTNAME)

best,
--Jakob

On Mon, Feb 22, 2016 at 5:09 PM, Arko Provo Mukherjee
 wrote:
> Hello,
>
> I am running Spark on Windows.
>
> I start up master as follows:
> .\spark-class.cmd org.apache.spark.deploy.master.Master
>
> I see that the SparkMaster doesn't start on 127.0.0.1 but starts on my
> "actual" IP. This is troublesome for me as I use it in my code and
> need to change every time I restart.
>
> Is there a way to make SparkMaster listen to 127.0.0.1:7077?
>
> Thanks much in advace!
> Warm regards
> Arko
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Jakob Odersky

I think the issue is that the `json.read` function has no idea of the
underlying schema, in fact the documentation
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader)
says:

> Unless the schema is specified using schema function, this function goes 
> through the input once to determine the input schema.

so since your test data does not contain a record with a product_id,
json.read creates a schema that does not contain it. Only after
determining the (incorrect) schema, you treat it as a Dataset of
CustomerEvent which will fail.
Try creating a schema (StructType) manually for your CustomerEvent
case class and pass it to the `json.schema` function before calling
`read`. I.e. something like

val sch = StructType(StructField("customer_id",StringType,false),
StructField(porduct_id,IntegerType,true)) //there's probably a better
way to get the schema from a case class
val customers: Dataset[CustomerEvent] =
sqlContext.read.schema(sch).json(rdd).as[CustomerEvent]

just a pointer, I haven't tested this.
regards,
--Jakob

On Mon, Feb 22, 2016 at 12:17 PM, Jorge Machado  wrote:
> Hi Anthony,
>
> I try the code on my self.  I think it is on the jsonStr:
>
> I do it with : val jsonStr = """{"customer_id":
> "3ee066ab571e03dd5f3c443a6c34417a","product_id": 3}”""
>
> or is it the “,” after your 3 oder the “\n”
>
> Regards
>
>
>
> On 22/02/2016, at 15:42, Anthony Brew  wrote:
>
> Hi,
>  I'm trying to parse JSON data into a case class using the
> DataFrame.as[] function, nut I am hitting an unusual error and the interweb
> isnt solving my pain so thought I would reach out for help. Ive truncated my
> code a little here to make it readable, but the error is full
>
> My case class looks like
>
> case class CustomerEvent(
>   customer_id: String,
>   product_id: Option[Long] = None,
> )
>
>
> My passing test looks like
>
> "A Full CustomerEvent JSON Object" should "Parse Correctly" in {
>   val jsonStr = """ {
>  "customer_id": "3ee066ab571e03dd5f3c443a6c34417a",
>  "product_id": 3,
> }
>  """
>// apparently deprecation is not an issue
>val rdd = sc.parallelize(Seq(jsonStr))
>
>import sqlContext.implicits._
>val customers: Dataset[CustomerEvent] =
> sqlContext.read.json(rdd).as[CustomerEvent]
>
>val ce: CustomerEvent = customers.first()
>ce.customer_id should be ("3ee066ab571e03dd5f3c443a6c34417a")
>ce.product_id.get should be (3)
>  }
>
> My issue is when the product_id is not part of the json, I get a encoding
> error
>
> ie the following
>
>   "A Partial CustomerEvent JSON Object" should " should Parse Correctly" in
> {
> val jsonStr = """ {
>"customer_id": "3ee066ab571e03dd5f3c443a6c34417a"
>   }
>   """
> // apparently deprecation is not an issue
> val rdd = sc.parallelize(Seq(jsonStr))
>
> import sqlContext.implicits._
> val customers: Dataset[CustomerEvent] =
> sqlContext.read.json(rdd).as[CustomerEvent]
>
> val ce: CustomerEvent = customers.first()
> ce.customer_id should be ("3ee066ab571e03dd5f3c443a6c34417a")
> ce.product_id.isDefined should be (false)
>
>   }
>
>
>
> My error looks like
>
> Error while decoding: java.lang.UnsupportedOperationException: Cannot
> evaluate expression: upcast('product_id,DoubleType,- field (class:
> "scala.Option", name: "product_id"),- root class: "data.CustomerEvent")
> newinstance(class data.CustomerEvent,invoke(input[3,
> StringType],toString,ObjectType(class java.lang.String)),input[0,
> LongType],input[9, LongType],invoke(input[5,
> StringType],toString,ObjectType(class java.lang.String)),invoke(input[6,
> StringType],toString,ObjectType(class java.lang.String)),input[7,
> LongType],invoke(input[1, StringType],toString,ObjectType(class
> java.lang.String)),wrapoption(input[8,
> LongType]),wrapoption(upcast('product_id,DoubleType,- field (class:
> "scala.Option", name: "product_id"),- root class:
> "data.CustomerEvent")),wrapoption(input[4,
> DoubleType]),wrapoption(invoke(input[2,
> StringType],toString,ObjectType(class
> java.lang.String))),false,ObjectType(class data.CustomerEvent),None)
> :- invoke(input[3, StringType],toString,ObjectType(class java.lang.String))
> :  +- input[3, StringType]
> :- input[0, LongType]
> :- input[9, LongType]
> :- invoke(input[5, StringType],toString,ObjectType(class java.lang.String))
> :  +- input[5, StringType]
> :- invoke(input[6, StringType],toString,ObjectType(class java.lang.String))
> :  +- input[6, StringType]
> :- input[7, LongType]
> :- invoke(input[1, StringType],toString,ObjectType(class java.lang.String))
> :  +- input[1, StringType]
> :- wrapoption(input[8, LongType])
> :  +- input[8, LongType]
> :- wrapoption(upcast('product_id,DoubleType,- field (class: "scala.Option",
> name:

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky

Hi Junjie,

How do you access the files currently? Have you considered using hdfs? It's
designed to be distributed across a cluster and Spark has built-in support.

Best,
--Jakob
On Feb 11, 2016 9:33 AM, "Junjie Qian"  wrote:

> Hi all,
>
> I am working with Spark 1.6, scala and have a big dataset divided into
> several small files.
>
> My question is: right now the read operation takes really long time and
> often has RDD warnings. Is there a way I can read the files in parallel,
> that all nodes or workers read the file at the same time?
>
> Many thanks
> Junjie
>

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky

Hi Mich,
your assumptions 1 to 3 are all correct (nitpick: they're method
*calls*, the methods being the part before the parentheses, but I
assume that's what you meant). The last one is also a method call but
uses syntactic sugar on top: `foreach(println)` boils down to
`foreach(line => println(line))`.

On an unrelated side-note, I would suggest you add a period between
every method call, it makes things easier to read and is actually
required in certain circumstances. Specifically I would add a period
before collect() and foreach().

best,
--Jakob

On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh
 wrote:
>
>
> Hi Chandeep
>
>
>
> Many thanks for your help
>
>
>
> In the line below
>
>
>
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
>
>
> Can you please clarify the components with the correct naming as I am new to
> Scala
>
> errlog   --> is the RDD?
> filter(line => line.contains("sed")) is a method
> collect()  is another method ?
> foreach (println) ?
>
>
>
> Thanks
>
>
>
> On 10/02/2016 21:28, Chandeep Singh wrote:
>
> Hi Mich,
>
> If you would like to print everything to the console you could -
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
> or you could always save to a file using any of the saveAs methods.
>
> Thanks,
> Chandeep
>
> On Wed, Feb 10, 2016 at 8:14 PM,
>  wrote:
>>
>>
>>
>> Hi,
>>
>> I have a bunch of files stored in hdfs /unit_files directory in total 319
>> files
>>
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>>
>> scala>  errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>>
>> So it returns 1113 instances the word "sed"
>>
>> If I want to see the collection I can do
>>
>>
>> scala>  errlog.filter(line => line.contains("sed"))collect()
>>
>> res105: Array[String] = Array(" DSQUERY=${1} ;
>> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #.
>> in environment based on argument for script., "   exec sp_spaceused", "
>> exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "
>> BACKUPSERVER=$5# Server that is used to load the transaction dump",
>> "BACKUPSERVER=$5 # Server that is used to load the
>> transaction dump", "BACKUPSERVER=$5 # Server that is used to
>> load the transaction dump", "cat $TMPDIR/${DBNAME}_trandump.sql | sed
>> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat
>> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
>> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e
>> s/.ksh//), "B...
>> scala>
>>
>>
>> Now is there anyway I can retrieve all these instances or perhaps they are
>> all wrapped up and I only see few lines?
>>
>> Thanks,
>>
>> Mich
>
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Cloud Technology
> Partners Ltd, its subsidiaries or their employees, unless expressly so
> stated. It is the responsibility of the recipient to ensure that this email
> is virus free, therefore neither Cloud Technology partners Ltd, its
> subsidiaries nor their employees accept any responsibility.
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky

Exactly!
As a final note, `foreach` is also defined on RDDs. This means that
you don't need to `collect()` the results into an array (which could
give you an OutOfMemoryError in case the RDD is really really large)
before printing them.

Personally, when I learn using a new library, I like to look at its
Scaladoc 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
for Spark) and test it in the REPL/worksheets (for Spark you already
have `spark-shell`)

best,
--Jakob

On Wed, Feb 10, 2016 at 3:52 PM, Mich Talebzadeh
<mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Many thanks Jakob.
>
>
>
> So it basically boils down to this demarcation  as suggested which looks
> clearer
>
> val errlog = sc.textFile("/unix_files/*.ksh")
> errlog.filter(line => line.contains("sed")).collect().foreach(line =>
> println(line))
>
> Regards,
>
> Mich
>
> On 10/02/2016 23:21, Jakob Odersky wrote:
>
> Hi Mich,
> your assumptions 1 to 3 are all correct (nitpick: they're method
> *calls*, the methods being the part before the parentheses, but I
> assume that's what you meant). The last one is also a method call but
> uses syntactic sugar on top: `foreach(println)` boils down to
> `foreach(line => println(line))`.
>
> On an unrelated side-note, I would suggest you add a period between
> every method call, it makes things easier to read and is actually
> required in certain circumstances. Specifically I would add a period
> before collect() and foreach().
>
> best,
> --Jakob
>
> On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh
> <mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
> Hi Chandeep Many thanks for your help In the line below errlog.filter(line
> => line.contains("sed"))collect()foreach(println) Can you please clarify the
> components with the correct naming as I am new to Scala errlog --> is the
> RDD? filter(line => line.contains("sed")) is a method collect() is another
> method ? foreach (println) ? Thanks On 10/02/2016 21:28, Chandeep Singh
> wrote: Hi Mich, If you would like to print everything to the console you
> could - errlog.filter(line => line.contains("sed"))collect()foreach(println)
> or you could always save to a file using any of the saveAs methods. Thanks,
> Chandeep On Wed, Feb 10, 2016 at 8:14 PM,
> <mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
> Hi, I have a bunch of files stored in hdfs /unit_files directory in total
> 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala>
> errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it
> returns 1113 instances the word "sed" If I want to see the collection I can
> do scala> errlog.filter(line => line.contains("sed"))collect() res105:
> Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ;
> PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on
> argument for script., " exec sp_spaceused", " exec sp_spaceused",
> PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server that
> is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is
> used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used
> to load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed
> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat
> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e
> s/.ksh//), " B... scala> Now is there anyway I can retrieve all these
> instances or perhaps they are all wrapped up and I only see few lines?
> Thanks, Mich
>
> -- Dr Mich Talebzadeh LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com NOTE: The information in this email is
> proprietary and confidential. This message is for the designated recipient
> only, if you are not the intended recipient, you should destroy it
> immediately. Any information in this message shall not be understood as
> given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or
> their employees, unless expressly so stated. It is the responsibility of the
> recipient to ensure that this email is virus free, therefore neither Cloud
> Technology partners Ltd, its subsidiaries nor their employees accept any
> responsibility.
>
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://t

Re: How to collect/take arbitrary number of records in the driver?

2016-02-10 Thread Jakob Odersky

Another alternative:

rdd.take(1000).drop(100) //this also preserves ordering

Note however that this can lead to an OOM if the data you're taking is
too large. If you want to perform some operation sequentially on your
driver and don't care about performance, you could do something
similar as Mohammed suggested:

val filteredRDD = //same as previous post

filteredRDD.foreach{ elem =>
  // do something with elem, e.g. save to database
}



On Tue, Feb 9, 2016 at 2:56 PM, Mohammed Guller  wrote:
> You can do something like this:
>
>
>
> val indexedRDD = rdd.zipWithIndex
>
> val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) &&
> (index < 199)}
>
> val result = filteredRDD.take(100)
>
>
>
> Warning: the ordering of the elements in the RDD is not guaranteed.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
>
>
>
> -Original Message-
> From: SRK [mailto:swethakasire...@gmail.com]
> Sent: Tuesday, February 9, 2016 1:58 PM
> To: user@spark.apache.org
> Subject: How to collect/take arbitrary number of records in the driver?
>
>
>
> Hi ,
>
>
>
> How to get a fixed amount of records from an RDD in Driver? Suppose I want
> the records from 100 to 1000 and then save them to some external database, I
> know that I can do it from Workers in partition but I want to avoid that for
> some reasons. The idea is to collect the data to driver and save, although
> slowly.
>
>
>
> I am looking for something like take(100, 1000)  or take (1000,2000)
>
>
>
> Thanks,
>
> Swetha
>
>
>
>
>
>
>
> --
>
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> -
>
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jakob Odersky

Can you share some code that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.

On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov  wrote:
> Hi Guys,
>
> I need help with Spark memory errors when executing ML pipelines.
> The error that I see is:
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> stage 32.0 (TID 3298)
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> stage 32.0 (TID 3278)
>
>
> 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> curMem=296303415, maxMem=8890959790
>
>
> 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> memory (estimated size 1911.9 MB, free 6.1 GB)
>
>
> 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> SIGTERM
>
>
> 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> 3278)
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> 3298). 2004728720 bytes result sent via BlockManager)
>
>
> 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> thread Thread[Executor task launch worker-8,5,main]
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> system...
>
>
> 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> stopped.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> shutdown complete.
>
>
>
>
>
> And …..
>
>
>
>
>
> 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> : 10.0.0.5:30050
>
>
> 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> container_1454421662639_0011_01_05 (state: COMPLETE, exit status: -104)
>
>
> 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> containers, each with 2 cores and 16768 MB memory including 384 MB overhead
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> container_1454421662639_0011_01_37 for on host 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl:
> akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler,
> executorHostname: 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> launching executors on 1 of them.
>
>
> I'll really appreciate any help here.
>
> Thank you,
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky

To address one specific question:

> Docs says it usues sun.misc.unsafe to convert physical rdd structure into
byte array at some point for optimized GC and memory. My question is why is
it only applicable to SQL/Dataframe and not RDD? RDD has types too!

A principal difference between RDDs and DataFrames/Datasets is that the
latter have a schema associated to them. This means that they support only
certain types (primitives, case classes and more) and that they are
uniform, whereas RDDs can contain any serializable object and must not
necessarily be uniform. These properties make it possible to generate very
efficient serialization and other optimizations that cannot be achieved
with plain RDDs.

Re: Spark 2.0.0 release plan

2016-01-29 Thread Jakob Odersky

I'm not an authoritative source but I think it is indeed the plan to
move the default build to 2.11.

See this discussion for more detail
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html

On Fri, Jan 29, 2016 at 11:43 AM, Deenar Toraskar
 wrote:
> A related question. Are the plans to move the default Spark builds to Scala
> 2.11 with Spark 2.0?
>
> Regards
> Deenar
>
> On 27 January 2016 at 19:55, Michael Armbrust 
> wrote:
>>
>> We do maintenance releases on demand when there is enough to justify doing
>> one.  I'm hoping to cut 1.6.1 soon, but have not had time yet.
>>
>> On Wed, Jan 27, 2016 at 8:12 AM, Daniel Siegmann
>>  wrote:
>>>
>>> Will there continue to be monthly releases on the 1.6.x branch during the
>>> additional time for bug fixes and such?
>>>
>>> On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers 
>>> wrote:

 thanks thats all i needed

 On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen  wrote:
>
> I think it will come significantly later -- or else we'd be at code
> freeze for 2.x in a few days. I haven't heard anyone discuss this
> officially but had batted around May or so instead informally in
> conversation. Does anyone have a particularly strong opinion on that?
> That's basically an extra 3 month period.
>
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>
> On Tue, Jan 26, 2016 at 10:00 PM, Koert Kuipers 
> wrote:
> > Is the idea that spark 2.0 comes out roughly 3 months after 1.6? So
> > quarterly release as usual?
> > Thanks


>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-27 Thread Jakob Odersky

> the data type mapping has been taken care of in my code,
could you share this?

On Tue, Jan 26, 2016 at 8:30 PM, Anfernee Xu  wrote:
> Hi,
>
> I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from
> 3rdparty datasource, the data type mapping has been taken care of in my
> code, but when I issued below query,
>
> SELECT * FROM ( SELECT count(*) as failures from test WHERE state !=
> 'success' ) as tmp WHERE  ( COALESCE(failures , 0) > 10   )
>
> I got such exception, it seems some column has invalid data, but from the
> exception I cannot tell which one is culprit. Any suggestion is appreciated.
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
> (TID 3, slc00tuw): java.lang.ClassCastException: java.lang.String cannot be
> cast to java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
> at
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41)
> at
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:220)
> at
> org.apache.spark.sql.columnar.LongColumnStats.gatherStats(ColumnStats.scala:161)
> at
> org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
> at
> org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
> at
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
> at
> org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:142)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
> at
>

Re: Escaping tabs and newlines not working

2016-01-27 Thread Jakob Odersky

Can you provide some code the reproduces the issue, specifically in a spark
job? The linked stackoverflow question is related to plain scala and the
proposed answers offer a solution.

On Wed, Jan 27, 2016 at 1:57 PM, Harshvardhan Chauhan 
wrote:

>
>
> Hi,
>
> Escaping newline and tad dosent seem to work for me. Spark version 1.5.2
> on emr reading files from s3
>
> here is more details about my issue
>
> Scala escaping newline and tab characters
> 
> I am trying to use the following code to get rid of tab and newline
> characters in the url but I still get newline and…
> 
> stackoverflow.com
> 
>  [image:
> Mixmax] 
>
>
> *Harshvardhan Chauhan*  |  Software Engineer
> *GumGum*   |  *Ads that stick*
> 310-260-9666  |  ha...@gumgum.com
>

Re: Maintain state outside rdd

2016-01-27 Thread Jakob Odersky

Be careful with mapPartitions though, since it is executed on worker
nodes, you may not see side-effects locally.

Is it not possible to represent your state changes as part of your
rdd's transformations? I.e. return a tuple containing the modified
data and some accumulated state.
If that really doesn't work, I would second accumulators. Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka,
it also tells you how to define your own for custom data types.

On Wed, Jan 27, 2016 at 7:22 PM, Krishna  wrote:
> mapPartitions(...) seems like a good candidate, since, it's processing over
> a partition while maintaining state across map(...) calls.
>
> On Wed, Jan 27, 2016 at 6:58 PM, Ted Yu  wrote:
>>
>> Initially I thought of using accumulators.
>>
>> Since state change can be anything, how about storing state in external
>> NoSQL store such as hbase ?
>>
>> On Wed, Jan 27, 2016 at 6:37 PM, Krishna  wrote:
>>>
>>> Thanks; What I'm looking for is a way to see changes to the state of some
>>> variable during map(..) phase.
>>> I simplified the scenario in my example by making row_index() increment
>>> "incr" by 1 but in reality, the change to "incr" can be anything.
>>>
>>> On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu  wrote:

 Have you looked at this method ?

* Zips this RDD with its element indices. The ordering is first based
 on the partition index
 ...
   def zipWithIndex(): RDD[(T, Long)] = withScope {

 On Wed, Jan 27, 2016 at 6:03 PM, Krishna  wrote:
>
> Hi,
>
> I've a scenario where I need to maintain state that is local to a
> worker that can change during map operation. What's the best way to handle
> this?
>
> incr = 0
> def row_index():
>   global incr
>   incr += 1
>   return incr
>
> out_rdd = inp_rdd.map(lambda x: row_index()).collect()
>
> "out_rdd" in this case only contains 1s but I would like it to have
> index of each row in "inp_rdd".

>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Python UDFs

2016-01-27 Thread Jakob Odersky

Have you checked:

- the mllib doc for python
https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector
- the udf doc 
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.udf

You should be fine in returning a DenseVector as the return type of
the udf, as it provides access to a schema.

These are just directions to explore, I haven't used PySpark myself.

On Wed, Jan 27, 2016 at 10:38 AM, Stefan Panayotov  wrote:
> Hi,
>
> I have defined a UDF in Scala like this:
>
> import org.apache.spark.mllib.linalg.Vector
> import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary,
> Statistics}
> import org.apache.spark.mllib.linalg.DenseVector
>
> val determineVector = udf((a: Double, b: Double) => {
> val data: Array[Double] = Array(a,b)
> val dv = new DenseVector(data)
> dv
>   })
>
> How can I write the corresponding function in Pyhton/Pyspark?
>
> Thanks for your help
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using Spark in mixed Java/Scala project

2016-01-27 Thread Jakob Odersky

JavaSparkContext has a wrapper constructor for the "scala"
SparkContext. In this case all you need to do is declare a
SparkContext that is accessible both from the Java and Scala sides of
your project and wrap the context with a JavaSparkContext.

Search for java source compatibilty with scala for more information on
how to interface Java with Scala (the other way around is trivial).
Essentially, as long as you declare your SparkContext either in Java
or as a val/var/def in a plain Scala class you are good.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Jakob Odersky

Have you followed the guide on how to import spark into eclipse
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
?

On 18 January 2016 at 13:04, Andy Davidson 
wrote:

> Hi
>
> My project is implemented using Java 8 and Python. Some times its handy to
> look at the spark source code. For unknown reason if I open a spark project
> my java projects show tons of compiler errors. I think it may have
> something to do with Scala. If I close the projects my java code is fine.
>
> I typically I only want to import the machine learning and streaming
> projects.
>
> I am not sure if this is an issue or not but my java projects are built
> using gradel
>
> In eclipse preferences -> scala -> installations I selected Scala: 2.10.6
> (built in)
>
> Any suggestions would be greatly appreciate
>
> Andy
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky

I don't think RDDs are threadsafe.
More fundamentally however, why would you want to run RDD actions in
parallel? The idea behind RDDs is to provide you with an abstraction for
computing parallel operations on distributed data. Even if you were to call
actions from several threads at once, the individual executors of your
spark environment would still have to perform operations sequentially.

As an alternative, I would suggest to restructure your RDD transformations
to compute the required results in one single operation.

On 15 January 2016 at 06:18, Jonathan Coveney  wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira  escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky

I stand corrected. How considerable are the benefits though? Will the
scheduler be able to dispatch jobs from both actions simultaneously (or on
a when-workers-become-available basis)?

On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:

> we run multiple actions on the same (cached) rdd all the time, i guess in
> different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use them this
>> way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in
>> parallel? The idea behind RDDs is to provide you with an abstraction for
>> computing parallel operations on distributed data. Even if you were to call
>> actions from several threads at once, the individual executors of your
>> spark environment would still have to perform operations sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>
>>>> Hi,
>>>>
>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>>> this
>>>> be done ?
>>>>
>>>> Thank you,
>>>> Regards
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> <http://nabble.com>.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>
>>
>

Re: Why is this job running since one hour?

2016-01-06 Thread Jakob Odersky

What is the job doing? How much data are you processing?

On 6 January 2016 at 10:33, unk1102  wrote:

> Hi I have one main Spark job which spawns multiple child spark jobs. One of
> the child spark job is running for an hour and it keeps on hanging there I
> have taken snap shot please see
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25899/Screen_Shot_2016-01-06_at_11.jpg
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-this-job-running-since-one-hour-tp25899.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Jakob Odersky

Check the configuration guide for a description on units (
http://spark.apache.org/docs/latest/configuration.html#spark-properties).
In your case, 5GB would be specified as 5g.

On 6 January 2016 at 10:29, unk1102  wrote:

> Hi As part of Spark 1.6 release what should be ideal value or unit for
> spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it
> correct? Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-17 Thread Jakob Odersky

It might be a good idea to see how many files are open and try increasing
the open file limit (this is done on an os level). In some application
use-cases it is actually a legitimate need.

If that doesn't help, make sure you close any unused files and streams in
your code. It will also be easier to help diagnose the issue if you send an
error-reproducing snippet.

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky

When you re-run the last statement a second time, does it work? Could it be
related to https://issues.apache.org/jira/browse/SPARK-12350 ?

On 16 December 2015 at 10:39, Ted Yu  wrote:

> Hi,
> I used the following command on a recently refreshed checkout of master
> branch:
>
> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>
> I was then running simple query in spark-shell:
> Seq(
>   (83, 0, 38),
>   (26, 0, 79),
>   (43, 81, 24)
> ).toDF("a", "b", "c").registerTempTable("cachedData")
>
> sqlContext.cacheTable("cachedData")
> sqlContext.sql("select * from cachedData").show
>
> However, I encountered errors in the following form:
>
> http://pastebin.com/QeiwJpwi
>
> Under workspace, I found:
>
> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>
> but no ByteOrder.class.
>
> Did I miss some step(s) ?
>
> Thanks
>

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky

Yeah, the same kind of error actually happens in the JIRA. It actually
succeeds but a load of exceptions are thrown. Subsequent runs don't produce
any errors anymore.

On 16 December 2015 at 10:55, Ted Yu <yuzhih...@gmail.com> wrote:

> The first run actually worked. It was the amount of exceptions preceding
> the result that surprised me.
>
> I want to see if there is a way of getting rid of the exceptions.
>
> Thanks
>
> On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky <joder...@gmail.com>
> wrote:
>
>> When you re-run the last statement a second time, does it work? Could it
>> be related to https://issues.apache.org/jira/browse/SPARK-12350 ?
>>
>> On 16 December 2015 at 10:39, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Hi,
>>> I used the following command on a recently refreshed checkout of master
>>> branch:
>>>
>>> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
>>> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>>>
>>> I was then running simple query in spark-shell:
>>> Seq(
>>>   (83, 0, 38),
>>>   (26, 0, 79),
>>>   (43, 81, 24)
>>> ).toDF("a", "b", "c").registerTempTable("cachedData")
>>>
>>> sqlContext.cacheTable("cachedData")
>>> sqlContext.sql("select * from cachedData").show
>>>
>>> However, I encountered errors in the following form:
>>>
>>> http://pastebin.com/QeiwJpwi
>>>
>>> Under workspace, I found:
>>>
>>> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>>>
>>> but no ByteOrder.class.
>>>
>>> Did I miss some step(s) ?
>>>
>>> Thanks
>>>
>>
>>
>

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky

For future reference, this should be fixed with PR #10337 (
https://github.com/apache/spark/pull/10337)

On 16 December 2015 at 11:01, Jakob Odersky <joder...@gmail.com> wrote:

> Yeah, the same kind of error actually happens in the JIRA. It actually
> succeeds but a load of exceptions are thrown. Subsequent runs don't produce
> any errors anymore.
>
> On 16 December 2015 at 10:55, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> The first run actually worked. It was the amount of exceptions preceding
>> the result that surprised me.
>>
>> I want to see if there is a way of getting rid of the exceptions.
>>
>> Thanks
>>
>> On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky <joder...@gmail.com>
>> wrote:
>>
>>> When you re-run the last statement a second time, does it work? Could it
>>> be related to https://issues.apache.org/jira/browse/SPARK-12350 ?
>>>
>>> On 16 December 2015 at 10:39, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I used the following command on a recently refreshed checkout of master
>>>> branch:
>>>>
>>>> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
>>>> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>>>>
>>>> I was then running simple query in spark-shell:
>>>> Seq(
>>>>   (83, 0, 38),
>>>>   (26, 0, 79),
>>>>   (43, 81, 24)
>>>> ).toDF("a", "b", "c").registerTempTable("cachedData")
>>>>
>>>> sqlContext.cacheTable("cachedData")
>>>> sqlContext.sql("select * from cachedData").show
>>>>
>>>> However, I encountered errors in the following form:
>>>>
>>>> http://pastebin.com/QeiwJpwi
>>>>
>>>> Under workspace, I found:
>>>>
>>>> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>>>>
>>>> but no ByteOrder.class.
>>>>
>>>> Did I miss some step(s) ?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: ideal number of executors per machine

2015-12-15 Thread Jakob Odersky

Hi Veljko,
I would assume keeping the number of executors per machine to a minimum is
best for performance (as long as you consider memory requirements as well).
Each executor is a process that can run tasks in multiple threads. On a
kernel/hardware level, thread switches are much cheaper than process
switches and therefore having a single executor with multiple threads gives
a better over-all performance that multiple executors with less threads.

--Jakob

On 15 December 2015 at 13:07, Veljko Skarich 
wrote:

> Hi,
>
> I'm looking for suggestions on the ideal number of executors per machine.
> I run my jobs on 64G 32 core machines, and at the moment I have one
> executor running per machine, on the spark standalone cluster.
>
>  I could not find many guidelines for figuring out the ideal number of
> executors; the Spark official documentation merely recommends not having
> more than 64G per executor to avoid GC issues. Anyone have and advice on
> this?
>
> thank you.
>

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky

With DataFrames you loose type-safety. Depending on the language you are
using this can also be considered a drawback.

On 15 December 2015 at 15:08, Jakob Odersky <joder...@gmail.com> wrote:

> By using DataFrames you will not need to specify RDD operations explicity,
> instead the operations are built and optimized for by using the information
> available in the DataFrame's schema.
> The only draw-back I can think of is some loss of generality: given a
> dataframe containing types A, you will be able to include types B even if B
> is a sub-type of A. However, in real use-cases I have never run into this
> problem.
>
> I once had a related question on RDDs and DataFrames, here is the answer I
> got from Michael Armbrust:
>
> Here is how I view the relationship between the various components of
>> Spark:
>>
>>  - *RDDs - *a low level API for expressing DAGs that will be executed in
>> parallel by Spark workers
>>  - *Catalyst -* an internal library for expressing trees that we use to
>> build relational algebra and expression evaluation.  There's also an
>> optimizer and query planner than turns these into logical concepts into RDD
>> actions.
>>  - *Tungsten -* an internal optimized execution engine that can compile
>> catalyst expressions into efficient java bytecode that operates directly on
>> serialized binary data.  It also has nice low level data structures /
>> algorithms like hash tables and sorting that operate directly on serialized
>> data.  These are used by the physical nodes that are produced by the
>> query planner (and run inside of RDD operation on workers).
>>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
>> constructing dataflows that are backed by catalyst logical plans
>>  - *Datasets* - a user facing API that is similar to the RDD API for
>> constructing dataflows that are backed by catalyst logical plans
>>
>> So everything is still operating on RDDs but I anticipate most users will
>> eventually migrate to the higher level APIs for convenience and automatic
>> optimization
>>
>
> Hope that also helps you get an idea of the different concepts and their
> potential advantages/drawbacks.
>

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky

By using DataFrames you will not need to specify RDD operations explicity,
instead the operations are built and optimized for by using the information
available in the DataFrame's schema.
The only draw-back I can think of is some loss of generality: given a
dataframe containing types A, you will be able to include types B even if B
is a sub-type of A. However, in real use-cases I have never run into this
problem.

I once had a related question on RDDs and DataFrames, here is the answer I
got from Michael Armbrust:

Here is how I view the relationship between the various components of Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the
> query planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>

Hope that also helps you get an idea of the different concepts and their
potential advantages/drawbacks.

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky

> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop
2.4.1.but I also find something strange like this :

>
http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html
> (if i use "textFile",It can't run.)

In the link you sent, there is still an `addJar(spark-assembly-hadoop-xx)`,
can you try running your application with that?

On 11 December 2015 at 03:08, Bonsen  wrote:

> Thank you,and I find the problem is my package is test,but I write package
> org.apache.spark.examples ,and IDEA had imported the
> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
> problems
> __
> Now , I change the package like this:
>
> package test
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> object test {
>   def main(args: Array[String]) {
> val conf = new
> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
> doesn't work.!?
> val rawData = sc.textFile("/home/hadoop/123.csv")
> val secondData = rawData.flatMap(_.split(",").toString)
> println(secondData.first)   /line 32
> sc.stop()
>   }
> }
> it causes that:
> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> 
> 
> //  219.216.65.129 is my worker computer.
> //  I can connect to my worker computer.
> // Spark can start successfully.
> //  addFile is also doesn't work,the tmp file will also dismiss.
>
>
>
>
>
>
> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" <[hidden
> email] > wrote:
>
> You are trying to print an array, but anyway it will print the objectID
>  of the array if the input is same as you have shown here. Try flatMap()
> instead of map and check if the problem is same.
>
>--Himanshu
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
> java.lang.Intege " for a long time., click here.
> NAML
> 
>
>
>
>
>
> --
> View this message in context: Re:Re: HELP! I get "java.lang.String cannot
> be cast to java.lang.Intege " for a long time.
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky

sorry typo, I meant *without* the addJar

On 14 December 2015 at 11:13, Jakob Odersky <joder...@gmail.com> wrote:

> > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop
> 2.4.1.but I also find something strange like this :
>
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html
> > (if i use "textFile",It can't run.)
>
> In the link you sent, there is still an
> `addJar(spark-assembly-hadoop-xx)`, can you try running your application
> with that?
>
> On 11 December 2015 at 03:08, Bonsen <hengbohe...@126.com> wrote:
>
>> Thank you,and I find the problem is my package is test,but I write
>> package org.apache.spark.examples ,and IDEA had imported the
>> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
>> problems
>> __
>> Now , I change the package like this:
>>
>> package test
>> import org.apache.spark.SparkConf
>> import org.apache.spark.SparkContext
>> object test {
>>   def main(args: Array[String]) {
>> val conf = new
>> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
>> val sc = new SparkContext(conf)
>> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
>> doesn't work.!?
>> val rawData = sc.textFile("/home/hadoop/123.csv")
>> val secondData = rawData.flatMap(_.split(",").toString)
>> println(secondData.first)   /line 32
>> sc.stop()
>>   }
>> }
>> it causes that:
>> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>> 
>> 
>> //  219.216.65.129 is my worker computer.
>> //  I can connect to my worker computer.
>> // Spark can start successfully.
>> //  addFile is also doesn't work,the tmp file will also dismiss.
>>
>>
>>
>>
>>
>>
>> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" 
>> <[hidden
>> email] <http:///user/SendEmail.jtp?type=node=25689=0>> wrote:
>>
>> You are trying to print an array, but anyway it will print the objectID
>>  of the array if the input is same as you have shown here. Try flatMap()
>> instead of map and check if the problem is same.
>>
>>--Himanshu
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
>> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
>> java.lang.Intege " for a long time., click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>>
>> --
>> View this message in context: Re:Re: HELP! I get "java.lang.String
>> cannot be cast to java.lang.Intege " for a long time.
>> <http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25689.html>
>>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

It looks like you have an issue with your classpath, I think it is because
you add a jar containing Spark twice: first, you have a dependency on Spark
somewhere in your build tool (this allows you to compile and run your
application), second you re-add Spark here

>  sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
doesn't work.!?

I recommend you remove that line and see if everything works.
If you have that line because you need hadoop 2.6, I recommend you build
spark against that version and publish locally with maven

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

Btw, Spark 1.5 comes with support for hadoop 2.2 by default

On 11 December 2015 at 03:08, Bonsen  wrote:

> Thank you,and I find the problem is my package is test,but I write package
> org.apache.spark.examples ,and IDEA had imported the
> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
> problems
> __
> Now , I change the package like this:
>
> package test
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> object test {
>   def main(args: Array[String]) {
> val conf = new
> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
> doesn't work.!?
> val rawData = sc.textFile("/home/hadoop/123.csv")
> val secondData = rawData.flatMap(_.split(",").toString)
> println(secondData.first)   /line 32
> sc.stop()
>   }
> }
> it causes that:
> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> 
> 
> //  219.216.65.129 is my worker computer.
> //  I can connect to my worker computer.
> // Spark can start successfully.
> //  addFile is also doesn't work,the tmp file will also dismiss.
>
>
>
>
>
>
> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" <[hidden
> email] > wrote:
>
> You are trying to print an array, but anyway it will print the objectID
>  of the array if the input is same as you have shown here. Try flatMap()
> instead of map and check if the problem is same.
>
>--Himanshu
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
> java.lang.Intege " for a long time., click here.
> NAML
> 
>
>
>
>
>
> --
> View this message in context: Re:Re: HELP! I get "java.lang.String cannot
> be cast to java.lang.Intege " for a long time.
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky

Could you provide some more context? What is rawData?

On 10 December 2015 at 06:38, Bonsen  wrote:

> I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))"
>
> and I find:
> 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassCastException: java.lang.String cannot be
> cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
> at
> org.apache.spark.examples.SparkPi$$anonfun$1.apply(SparkPi.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25668.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Jakob Odersky

Is there any other process using port 7077?

On 10 December 2015 at 08:52, Andy Davidson 
wrote:

> Hi
>
> I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My
> job seems to run with out any problem.
>
> Kind regards
>
> Andy
>
> + /root/spark/bin/spark-submit --class
> com.pws.spark.streaming.IngestDriver --master spark://
> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
> --total-executor-cores 2 --deploy-mode cluster
> hdfs:///home/ec2-user/build/ingest-all.jar --clusterMode --dirPath week_3
>
> Running Spark using the REST application submission protocol.
>
> 15/12/10 16:46:33 WARN RestSubmissionClient: Unable to connect to server
> spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077.
>
> Warning: Master endpoint
> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077 was not a REST
> server. Falling back to legacy submission gateway instead.
>

Re: StackOverflowError when writing dataframe to table

2015-12-10 Thread Jakob Odersky

Can you give us some more info about the dataframe and caching? Ideally a
set of steps to reproduce the issue


On 9 December 2015 at 14:59, apu mishra . rr  wrote:

> The command
>
> mydataframe.write.saveAsTable(name="tablename")
>
> sometimes results in java.lang.StackOverflowError (see below for fuller
> error message).
>
> This is after I am able to successfully run cache() and show() methods on
> mydataframe.
>
> The issue is not deterministic, i.e. the same code sometimes works fine,
> sometimes not.
>
> I am running PySpark with:
>
> spark-submit --master local[*] --driver-memory 24g --executor-memory 24g
>
> Any help understanding this issue would be appreciated!
>
> Thanks, Apu
>
> Fuller error message:
>
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>
> at
> java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2281)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1428)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>
> at
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>
> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>
> at
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>
> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>
> at
>

Re: Help with type check

2015-11-30 Thread Jakob Odersky

Hi Eyal,

what you're seeing is not a Spark issue, it is related to boxed types.

I assume 'b' in your code is some kind of java buffer, where b.getDouble()
returns an instance of java.lang.Double and not a scala.Double. Hence
muCouch is an Array[java.lang.Double], an array containing boxed doubles.

To fix your problem, change 'yield b.getDouble(i)' to 'yield
b.getDouble(i).doubleValue'

You might want to have a look at these too:
-
http://stackoverflow.com/questions/23821576/efficient-conversion-of-java-util-listjava-lang-double-to-scala-listdouble
- https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html
- http://www.scala-lang.org/api/current/index.html#scala.Double

On 30 November 2015 at 10:13, Eyal Sharon  wrote:

> Hi ,
>
> I have problem with inferring what are the types bug here
>
> I have this code fragment . it parse Json to Array[Double]
>
>
>
>
>
>
> *val muCouch = {  val e = input.filter( _.id=="mu")(0).content()  val b  = 
> e.getArray("feature_mean")  for (i <- 0 to e.getInt("features") ) yield 
> b.getDouble(i)}.toArray*
>
> Now the problem is when I want to create a dense vector  :
>
> *new DenseVector(muCouch)*
>
>
> I get the following error :
>
>
> *Error:(111, 21) type mismatch;
>  found   : Array[java.lang.Double]
>  required: Array[scala.Double] *
>
>
> Now , I probably get a workaround for that , but I want to get a deeper 
> understanding  on why it occurs
>
> p.s - I do use collection.JavaConversions._
>
> Thanks !
>
>
>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky

Hi everyone,

I'm doing some reading-up on all the newer features of Spark such as
DataFrames, DataSets and Project Tungsten. This got me a bit confused on
the relation between all these concepts.

When starting to learn Spark, I read a book and the original paper on RDDs,
this lead me to basically think "Spark == RDDs".
Now, looking into DataFrames, I read that they are basically (distributed)
collections with an associated schema, thus enabling declarative queries
and optimization (through Catalyst). I am uncertain how DataFrames relate
to RDDs: are DataFrames transformed to operations on RDDs once they have
been optimized? Or are they completely different concepts? In case of the
latter, do DataFrames still use the Spark scheduler and get broken down
into a DAG of stages and tasks?

Regarding project Tungsten, where does it fit in? To my understanding it is
used to efficiently cache data in memory and may also be used to generate
query code for specialized hardware. This sounds as though it would work on
Spark's worker nodes, however it would also only work with
schema-associated data (aka DataFrames), thus leading me to the conclusion
that RDDs and DataFrames do not share a common backend which in turn
contradicts my conception of "Spark == RDDs".

Maybe I missed the obvious as these questions seem pretty basic, however I
was unable to find clear answers in Spark documentation or related papers
and talks. I would greatly appreciate any clarifications.

thanks,
--Jakob

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky

Thanks Michael, that helped me a lot!

On 23 November 2015 at 17:47, Michael Armbrust <mich...@databricks.com>
wrote:

> Here is how I view the relationship between the various components of
> Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the query
> planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>
> On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm doing some reading-up on all the newer features of Spark such as
>> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
>> the relation between all these concepts.
>>
>> When starting to learn Spark, I read a book and the original paper on
>> RDDs, this lead me to basically think "Spark == RDDs".
>> Now, looking into DataFrames, I read that they are basically
>> (distributed) collections with an associated schema, thus enabling
>> declarative queries and optimization (through Catalyst). I am uncertain how
>> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs
>> once they have been optimized? Or are they completely different concepts?
>> In case of the latter, do DataFrames still use the Spark scheduler and get
>> broken down into a DAG of stages and tasks?
>>
>> Regarding project Tungsten, where does it fit in? To my understanding it
>> is used to efficiently cache data in memory and may also be used to
>> generate query code for specialized hardware. This sounds as though it
>> would work on Spark's worker nodes, however it would also only work with
>> schema-associated data (aka DataFrames), thus leading me to the conclusion
>> that RDDs and DataFrames do not share a common backend which in turn
>> contradicts my conception of "Spark == RDDs".
>>
>> Maybe I missed the obvious as these questions seem pretty basic, however
>> I was unable to find clear answers in Spark documentation or related papers
>> and talks. I would greatly appreciate any clarifications.
>>
>> thanks,
>> --Jakob
>>
>
>

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky

that definitely looks like a bug, go ahead with filing an issue
I'll check the scala repl source code to see what, if any, other commands
there are that should be disabled

On 19 November 2015 at 12:54, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Dunno the answer, but :reset should be blocked, too, for obvious reasons.
>
> ➜  spark git:(master) ✗ ./bin/spark-shell
> ...
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> :reset
> Resetting interpreter state.
> Forgetting this session history:
>
>
>  @transient val sc = {
>val _sc = org.apache.spark.repl.Main.createSparkContext()
>println("Spark context available as sc.")
>_sc
>  }
>
>
>  @transient val sqlContext = {
>val _sqlContext = org.apache.spark.repl.Main.createSQLContext()
>println("SQL context available as sqlContext.")
>_sqlContext
>  }
>
> import org.apache.spark.SparkContext._
> import sqlContext.implicits._
> import sqlContext.sql
> import org.apache.spark.sql.functions._
> ...
>
> scala> import org.apache.spark._
> import org.apache.spark._
>
> scala> val sc = new SparkContext("local[*]", "shell", new SparkConf)
> ...
> org.apache.spark.SparkException: Only one SparkContext may be running
> in this JVM (see SPARK-2243). To ignore this error, set
> spark.driver.allowMultipleContexts = true. The currently running
> SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:82)
> ...
>
> Guess I should file an issue?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Apache Spark
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Nov 19, 2015 at 8:44 PM, Jakob Odersky <joder...@gmail.com> wrote:
> > I was just going through the spark shell code and saw this:
> >
> > private val blockedCommands = Set("implicits", "javap", "power",
> "type",
> > "kind")
> >
> > What is the reason as to why these commands are blocked?
> >
> > thanks,
> > --Jakob
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky

Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky

Hi Sukant,

Regarding the first point: when building spark during my daily work, I
always use Scala 2.11 and have only run into build problems once. Assuming
a working build I have never had any issues with the resulting artifacts.

More generally however, I would advise you to go with Scala 2.11 under all
circumstances. Scala 2.10 has reached end-of-life and, from what I make out
of your question, you have the opportunity to switch to a newer technology,
so why stay with legacy? Furthermore, Scala 2.12 will be coming out early
next year, so I reckon that Spark will switch to Scala 2.11 by default
pretty soon*.

regards,
--Jakob

*I'm myself pretty new to the Spark community so please don't take my words
on it as gospel


On 11 November 2015 at 15:25, Ted Yu  wrote:

> For #1, the published jars are usable.
> However, you should build from source for your specific combination of
> profiles.
>
> Cheers
>
> On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale <
> sha...@cognitivescale.com> wrote:
>
>> Hi,
>>
>> My company isn't using Spark in production yet, but we are using a bit of
>> Scala.  There's a few people who have wanted to be conservative and keep
>> our
>> Scala at 2.10 in the event we start using Spark.  There are others who
>> want
>> to move to 2.11 with the idea that by the time we're using Spark it will
>> be
>> more or less 2.11-ready.
>>
>> It's hard to make a strong judgement on these kinds of things without
>> getting some community feedback.
>>
>> Looking through the internet I saw:
>>
>> 1) There's advice to build 2.11 packages from source -- but also published
>> jars to Maven Central for 2.11.  Are these jars on Maven Central usable
>> and
>> the advice to build from source outdated?
>>
>> 2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay
>> for
>> us, but is there anything else to worry about?
>>
>> It would be nice to get some answers to those questions as well as any
>> other
>> feedback from maintainers or anyone that's used Spark with Scala 2.11
>> beyond
>> simple examples.
>>
>> Thanks,
>> Sukant
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Jakob Odersky

As another, general question, are spark packages the go-to way of extending
spark functionality? In my specific use-case I would like to start spark
(be it spark-shell or other) and hook into the listener API.
Since I wasn't able to find much documentation about spark packages, I was
wondering if they are still actively being developed?

thanks,
--Jakob

On 10 November 2015 at 14:58, Jakob Odersky <joder...@gmail.com> wrote:

> (accidental keyboard-shortcut sent the message)
> ... spark-shell from the spark 1.5.2 binary distribution.
> Also, running "spPublishLocal" has the same effect.
>
> thanks,
> --Jakob
>
> On 10 November 2015 at 14:55, Jakob Odersky <joder...@gmail.com> wrote:
>
>> Hi,
>> I ran into in error trying to run spark-shell with an external package
>> that I built and published locally
>> using the spark-package sbt plugin (
>> https://github.com/databricks/sbt-spark-package).
>>
>> To my understanding, spark packages can be published simply as maven
>> artifacts, yet after running "publishLocal" in my package project (
>> https://github.com/jodersky/spark-paperui), the following command
>>
>>park-shell --packages
>> ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>>
>> gives an error:
>>
>> ::
>>
>> ::  UNRESOLVED DEPENDENCIES ::
>>
>> ::
>>
>> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default
>>
>> ::
>>
>>
>> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>> Exception in thread "main" java.lang.RuntimeException: [unresolved
>> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default]
>> at
>> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
>> at
>> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>>
>> Do I need to include some default configuration? If so where and how
>> should I do it? All other packages I looked at had no such thing.
>>
>> Btw, I am using spark-shell from a
>>
>>
>

Re: Slow stage?

2015-11-11 Thread Jakob Odersky

Hi Simone,
I'm afraid I don't have an answer to your question. However I noticed the
DAG figures in the attachment. How did you generate these? I am myself
working on a project in which I am trying to generate visual
representations of the spark scheduler DAG. If such a tool already exists,
I would greatly appreciate any pointers.

thanks,
--Jakob

On 9 November 2015 at 13:52, Simone Franzini  wrote:

> Hi all,
>
> I have a complex Spark job that is broken up in many stages.
> I have a couple of stages that are particularly slow: each task takes
> around 6 - 7 minutes. This stage is fairly complex as you can see from the
> attached DAG. However, by construction each of the outer joins will have
> only 0 or 1 record on each side.
> It seems to me that this stage is really slow. However, the execution
> timeline shows that almost 100% of the time is spent in actual execution
> time not reading/writing to/from disk or in other overheads.
> Does this make any sense? I.e. is it just that these operations are slow
> (and notice task size in term of data seems small)?
> Is the pattern of operations in the DAG good or is it terribly suboptimal?
> If so, how could it be improved?
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

(accidental keyboard-shortcut sent the message)
... spark-shell from the spark 1.5.2 binary distribution.
Also, running "spPublishLocal" has the same effect.

thanks,
--Jakob

On 10 November 2015 at 14:55, Jakob Odersky <joder...@gmail.com> wrote:

> Hi,
> I ran into in error trying to run spark-shell with an external package
> that I built and published locally
> using the spark-package sbt plugin (
> https://github.com/databricks/sbt-spark-package).
>
> To my understanding, spark packages can be published simply as maven
> artifacts, yet after running "publishLocal" in my package project (
> https://github.com/jodersky/spark-paperui), the following command
>
>park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>
> gives an error:
>
> ::
>
> ::  UNRESOLVED DEPENDENCIES ::
>
> ::
>
> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default
>
> ::
>
>
> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> Exception in thread "main" java.lang.RuntimeException: [unresolved
> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default]
> at
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
> at
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>
> Do I need to include some default configuration? If so where and how
> should I do it? All other packages I looked at had no such thing.
>
> Btw, I am using spark-shell from a
>
>

Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

Hi,
I ran into in error trying to run spark-shell with an external package that
I built and published locally
using the spark-package sbt plugin (
https://github.com/databricks/sbt-spark-package).

To my understanding, spark packages can be published simply as maven
artifacts, yet after running "publishLocal" in my package project (
https://github.com/jodersky/spark-paperui), the following command

   park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT

gives an error:

::

::  UNRESOLVED DEPENDENCIES ::

::

:: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default

::


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved
dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default]
at
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12

Do I need to include some default configuration? If so where and how should
I do it? All other packages I looked at had no such thing.

Btw, I am using spark-shell from a

Re: Turn off logs in spark-sql shell

2015-10-16 Thread Jakob Odersky

[repost to mailing list, ok I gotta really start hitting that
reply-to-all-button]

Hi,
Spark uses Log4j which unfortunately does not support fine-grained
configuration over the command line. Therefore some configuration file
editing will have to be done (unless you want to configure Loggers
programatically, which however would require editing spark-sql).
Nevertheless, there seems to be a kind of "trick" where you can substitute
java environment variables in the log4j configuration file. See this
stackoverflow answer for details http://stackoverflow.com/a/31208461/917519.
After editing the properties file, you can then start spark-sql with:

bin/spark-sql --conf
"spark.driver.extraJavaOptions=-Dmy.logger.threshold=OFF"

this is untested but I hop it helps,
--Jakob

On 15 October 2015 at 22:56, Muhammad Ahsan 
wrote:

> Hello Everyone!
>
> I want to know how to turn off logging during starting *spark-sql shell*
> without changing log4j configuration files. In normal spark-shell I can use
> the following commands
>
> import org.apache.log4j.Loggerimport org.apache.log4j.Level
> Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF)
>
>
> Thanks
>
> --
> Thanks
>
> Muhammad Ahsan
>
>

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Jakob Odersky

[Repost to mailing list]

Hey,
Sorry about the typo, I of course meant hadoop-2.6, not 2.11.
I suspect something bad happened with my Ivy cache, since when reverting
back to scala 2.10, I got a very strange IllegalStateException (something
something IvyNode, I can't remember the details).
Kilking the cache made 2.10 work at least, I'll retry with 2.11

Thx for your help
On Oct 14, 2015 6:52 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:

> Adrian:
> Likely you were using maven.
>
> Jakob's report was with sbt.
>
> Cheers
>
> On Tue, Oct 13, 2015 at 10:05 PM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also
>> compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works.
>>
>> -adrian
>>
>> Sent from my iPhone
>>
>> On 14 Oct 2015, at 03:53, Jakob Odersky <joder...@gmail.com> wrote:
>>
>> I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
>> use is:
>>
>> dev/change-version-to-2.11.sh
>> build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11
>>
>> followed by
>>
>> compile
>>
>> in the sbt shell.
>>
>> The error I get specifically is:
>>
>> spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>> no valid targets for annotation on value conf - it is discarded unused. You
>> may specify targets with meta-annotations, e.g. @(transient @param)
>> [error] private[netty] class NettyRpcEndpointRef(@transient conf:
>> SparkConf)
>> [error]
>>
>> However I am also getting a large amount of deprecation warnings, making
>> me wonder if I am supplying some incompatible/unsupported options to sbt. I
>> am using Java 1.8 and the latest Spark master sources.
>> Does someone know if I am doing anything wrong or is the sbt build broken?
>>
>> thanks for you help,
>> --Jakob
>>
>>
>

Building with SBT and Scala 2.11

2015-10-13 Thread Jakob Odersky

I'm having trouble compiling Spark with SBT for Scala 2.11. The command I
use is:

dev/change-version-to-2.11.sh
build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11

followed by

compile

in the sbt shell.

The error I get specifically is:

spark/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
no valid targets for annotation on value conf - it is discarded unused. You
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error]

However I am also getting a large amount of deprecation warnings, making me
wonder if I am supplying some incompatible/unsupported options to sbt. I am
using Java 1.8 and the latest Spark master sources.
Does someone know if I am doing anything wrong or is the sbt build broken?

thanks for you help,
--Jakob

98 matches

Mail list logo