Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11

Re: Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky
(accidental keyboard-shortcut sent the message) ... spark-shell from the spark 1.5.2 binary distribution. Also, running "spPublishLocal" has the same effect. thanks, --Jakob On 10 November 2015 at 14:55, Jakob Odersky <joder...@gmail.com> wrote: > Hi, > I ran into in err

Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky
Hi, I ran into in error trying to run spark-shell with an external package that I built and published locally using the spark-package sbt plugin ( https://github.com/databricks/sbt-spark-package). To my understanding, spark packages can be published simply as maven artifacts, yet after running

Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky
Hi Sukant, Regarding the first point: when building spark during my daily work, I always use Scala 2.11 and have only run into build problems once. Assuming a working build I have never had any issues with the resulting artifacts. More generally however, I would advise you to go with Scala 2.11

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Jakob Odersky
if they are still actively being developed? thanks, --Jakob On 10 November 2015 at 14:58, Jakob Odersky <joder...@gmail.com> wrote: > (accidental keyboard-shortcut sent the message) > ... spark-shell from the spark 1.5.2 binary distribution. > Also, running "spPublishLocal&

Re: Slow stage?

2015-11-11 Thread Jakob Odersky
Hi Simone, I'm afraid I don't have an answer to your question. However I noticed the DAG figures in the attachment. How did you generate these? I am myself working on a project in which I am trying to generate visual representations of the spark scheduler DAG. If such a tool already exists, I

Re: Turn off logs in spark-sql shell

2015-10-16 Thread Jakob Odersky
[repost to mailing list, ok I gotta really start hitting that reply-to-all-button] Hi, Spark uses Log4j which unfortunately does not support fine-grained configuration over the command line. Therefore some configuration file editing will have to be done (unless you want to configure Loggers

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Jakob Odersky
:05 PM, Adrian Tanase <atan...@adobe.com> wrote: > >> Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also >> compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works. >> >> -adrian >> >> Sent from my iPhone >> >

Building with SBT and Scala 2.11

2015-10-13 Thread Jakob Odersky
I'm having trouble compiling Spark with SBT for Scala 2.11. The command I use is: dev/change-version-to-2.11.sh build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11 followed by compile in the sbt shell. The error I get specifically is:

Re: Help with type check

2015-11-30 Thread Jakob Odersky
Hi Eyal, what you're seeing is not a Spark issue, it is related to boxed types. I assume 'b' in your code is some kind of java buffer, where b.getDouble() returns an instance of java.lang.Double and not a scala.Double. Hence muCouch is an Array[java.lang.Double], an array containing boxed

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky
Could you provide some more context? What is rawData? On 10 December 2015 at 06:38, Bonsen wrote: > I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" > > and I find: > 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, >

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
When you re-run the last statement a second time, does it work? Could it be related to https://issues.apache.org/jira/browse/SPARK-12350 ? On 16 December 2015 at 10:39, Ted Yu wrote: > Hi, > I used the following command on a recently refreshed checkout of master > branch: >

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
he amount of exceptions preceding > the result that surprised me. > > I want to see if there is a way of getting rid of the exceptions. > > Thanks > > On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky <joder...@gmail.com> > wrote: > >> When you re-run the last statement a s

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
For future reference, this should be fixed with PR #10337 ( https://github.com/apache/spark/pull/10337) On 16 December 2015 at 11:01, Jakob Odersky <joder...@gmail.com> wrote: > Yeah, the same kind of error actually happens in the JIRA. It actually > succeeds but a load of exception

Re: ideal number of executors per machine

2015-12-15 Thread Jakob Odersky
Hi Veljko, I would assume keeping the number of executors per machine to a minimum is best for performance (as long as you consider memory requirements as well). Each executor is a process that can run tasks in multiple threads. On a kernel/hardware level, thread switches are much cheaper than

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
With DataFrames you loose type-safety. Depending on the language you are using this can also be considered a drawback. On 15 December 2015 at 15:08, Jakob Odersky <joder...@gmail.com> wrote: > By using DataFrames you will not need to specify RDD operations explicity, > instead th

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
By using DataFrames you will not need to specify RDD operations explicity, instead the operations are built and optimized for by using the information available in the DataFrame's schema. The only draw-back I can think of is some loss of generality: given a dataframe containing types A, you will

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Jakob Odersky
Is there any other process using port 7077? On 10 December 2015 at 08:52, Andy Davidson wrote: > Hi > > I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My > job seems to run with out any problem. > > Kind regards > > Andy > > +

Re: StackOverflowError when writing dataframe to table

2015-12-10 Thread Jakob Odersky
Can you give us some more info about the dataframe and caching? Ideally a set of steps to reproduce the issue On 9 December 2015 at 14:59, apu mishra . rr wrote: > The command > > mydataframe.write.saveAsTable(name="tablename") > > sometimes results in

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop 2.4.1.but I also find something strange like this : > http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html > (if i use "textFile",It can't run.) In the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
sorry typo, I meant *without* the addJar On 14 December 2015 at 11:13, Jakob Odersky <joder...@gmail.com> wrote: > > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop > 2.4.1.but I also find something strange like this : > > > > http://apache

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
It looks like you have an issue with your classpath, I think it is because you add a jar containing Spark twice: first, you have a dependency on Spark somewhere in your build tool (this allows you to compile and run your application), second you re-add Spark here >

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
Btw, Spark 1.5 comes with support for hadoop 2.2 by default On 11 December 2015 at 03:08, Bonsen wrote: > Thank you,and I find the problem is my package is test,but I write package > org.apache.spark.examples ,and IDEA had imported the > spark-examples-1.5.2-hadoop2.6.0.jar

Re: Why is this job running since one hour?

2016-01-06 Thread Jakob Odersky
What is the job doing? How much data are you processing? On 6 January 2016 at 10:33, unk1102 wrote: > Hi I have one main Spark job which spawns multiple child spark jobs. One of > the child spark job is running for an hour and it keeps on hanging there I > have taken snap

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Jakob Odersky
Check the configuration guide for a description on units ( http://spark.apache.org/docs/latest/configuration.html#spark-properties). In your case, 5GB would be specified as 5g. On 6 January 2016 at 10:29, unk1102 wrote: > Hi As part of Spark 1.6 release what should be

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-17 Thread Jakob Odersky
It might be a good idea to see how many files are open and try increasing the open file limit (this is done on an os level). In some application use-cases it is actually a legitimate need. If that doesn't help, make sure you close any unused files and streams in your code. It will also be easier

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ | > http://blog.jaceklaskowski.pl > Mastering Apache Spark > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jace

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Hi everyone, I'm doing some reading-up on all the newer features of Spark such as DataFrames, DataSets and Project Tungsten. This got me a bit confused on the relation between all these concepts. When starting to learn Spark, I read a book and the original paper on RDDs, this lead me to

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
facing API that is similar to the RDD API for > constructing dataflows that are backed by catalyst logical plans > > So everything is still operating on RDDs but I anticipate most users will > eventually migrate to the higher level APIs for convenience and automatic > optimization &g

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I don't think RDDs are threadsafe. More fundamentally however, why would you want to run RDD actions in parallel? The idea behind RDDs is to provide you with an abstraction for computing parallel operations on distributed data. Even if you were to call actions from several threads at once, the

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
e JDBC server. >> >> Matei >> >> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: >> >> I don't think RDDs are threadsafe. >> More fundamentally however, why would you want to run RDD actions in >> parallel? The idea behin

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Jakob Odersky
Have you followed the guide on how to import spark into eclipse https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse ? On 18 January 2016 at 13:04, Andy Davidson wrote: > Hi > > My project is implemented using Java

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie, How do you access the files currently? Have you considered using hdfs? It's designed to be distributed across a cluster and Spark has built-in support. Best, --Jakob On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote: > Hi all, > > I am working with Spark 1.6,

Re: How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-27 Thread Jakob Odersky
> the data type mapping has been taken care of in my code, could you share this? On Tue, Jan 26, 2016 at 8:30 PM, Anfernee Xu wrote: > Hi, > > I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from > 3rdparty datasource, the data type mapping has been

Re: Escaping tabs and newlines not working

2016-01-27 Thread Jakob Odersky
Can you provide some code the reproduces the issue, specifically in a spark job? The linked stackoverflow question is related to plain scala and the proposed answers offer a solution. On Wed, Jan 27, 2016 at 1:57 PM, Harshvardhan Chauhan wrote: > > > Hi, > > Escaping newline

Re: Maintain state outside rdd

2016-01-27 Thread Jakob Odersky
Be careful with mapPartitions though, since it is executed on worker nodes, you may not see side-effects locally. Is it not possible to represent your state changes as part of your rdd's transformations? I.e. return a tuple containing the modified data and some accumulated state. If that really

Re: Python UDFs

2016-01-27 Thread Jakob Odersky
Have you checked: - the mllib doc for python https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector - the udf doc https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.udf You should be fine in returning a DenseVector

Re: Using Spark in mixed Java/Scala project

2016-01-27 Thread Jakob Odersky
JavaSparkContext has a wrapper constructor for the "scala" SparkContext. In this case all you need to do is declare a SparkContext that is accessible both from the Java and Scala sides of your project and wrap the context with a JavaSparkContext. Search for java source compatibilty with scala for

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jakob Odersky
Can you share some code that produces the error? It is probably not due to spark but rather the way data is handled in the user code. Does your code call any reduceByKey actions? These are often a source for OOM errors. On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov wrote:

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky
To address one specific question: > Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD has types too! A principal difference between RDDs and

Re: Spark 2.0.0 release plan

2016-01-29 Thread Jakob Odersky
I'm not an authoritative source but I think it is indeed the plan to move the default build to 2.11. See this discussion for more detail http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html On Fri, Jan 29, 2016 at 11:43 AM, Deenar Toraskar

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Jakob Odersky
I think the issue is that the `json.read` function has no idea of the underlying schema, in fact the documentation (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) says: > Unless the schema is specified using schema function, this function goes >

Re: SparkMaster IP

2016-02-22 Thread Jakob Odersky
Spark master by default binds to whatever ip address your current host resolves to. You have a few options to change that: - override the ip by setting the environment variable SPARK_LOCAL_IP - change the ip in your local "hosts" file (/etc/hosts on linux, not sure on windows) - specify a

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Jakob Odersky
You can `filter` (scaladoc ) your dataframes before saving them to- or after reading them from parquet files On Wed, Feb 24, 2016 at 1:28 AM, Cheng Lian

Re: How could I do this algorithm in Spark?

2016-02-24 Thread Jakob Odersky
Hi Guillermo, assuming that the first "a,b" is a typo and you actually meant "a,d", this is a sorting problem. You could easily model your data as an RDD or tuples (or as a dataframe/set) and use the sortBy (or orderBy for dataframe/sets) methods. best, --Jakob On Wed, Feb 24, 2016 at 2:26 PM,

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Hi Mich, your assumptions 1 to 3 are all correct (nitpick: they're method *calls*, the methods being the part before the parentheses, but I assume that's what you meant). The last one is also a method call but uses syntactic sugar on top: `foreach(println)` boils down to `foreach(line =>

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
t; > println(line)) > > Regards, > > Mich > > On 10/02/2016 23:21, Jakob Odersky wrote: > > Hi Mich, > your assumptions 1 to 3 are all correct (nitpick: they're method > *calls*, the methods being the part before the parentheses, but I > assume that's what you

Re: How to collect/take arbitrary number of records in the driver?

2016-02-10 Thread Jakob Odersky
Another alternative: rdd.take(1000).drop(100) //this also preserves ordering Note however that this can lead to an OOM if the data you're taking is too large. If you want to perform some operation sequentially on your driver and don't care about performance, you could do something similar as

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
Hi Mich, probably unrelated to the current error you're seeing, however the following dependencies will bite you later: spark-hive_2.10 spark-csv_2.11 the problem here is that you're using libraries built for different Scala binary versions (the numbers after the underscore). The simple fix here

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
k >>>> spark-sql_2.10 >>>> 1.5.1 >>>> >>>> >>>> >>>> [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0 >>>> [INFO] >>>> >&

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
ommand env|grep SPARK; nothing comes back >>>> >>>> Tried env|grep Spark; which is the directory I created for Spark once I >>>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark >>>> >>>> Tried running ./bin/spark-shell ; come

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
regarding my previous message, I forgot to mention to run netstat as root (sudo netstat -plunt) sorry for the noise On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky <ja...@odersky.com> wrote: > Some more diagnostics/suggestions: > > 1) are other services listening to ports in the

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-14 Thread Jakob Odersky
Have you tried setting the configuration `spark.executor.extraLibraryPath` to point to a location where your .so's are available? (Not sure if non-local files, such as HDFS, are supported) On Mon, Mar 14, 2016 at 2:12 PM, Tristan Nixon wrote: > What build system are you

Re: Installing Spark on Mac

2016-03-08 Thread Jakob Odersky
I've had some issues myself with the user-provided-Hadoop version. If you simply just want to get started, I would recommend downloading Spark (pre-built, with any of the hadoop versions) as Cody suggested. A simple step-by-step guide: 1. curl

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
Sorry had a typo in my previous message: > try running just "/bin/spark-shell" please remove the leading slash (/) On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera wrote: > Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to > set it. How would I do

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
As Tristan mentioned, it looks as though Spark is trying to bind on port 0 and then 1 (which is not allowed). Could it be that some environment variables from you previous installation attempts are polluting your configuration? What does running "env | grep SPARK" show you? Also, try running just

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
line of spark-submit or pyspark. See >> http://spark.apache.org/docs/latest/submitting-applications.html >> >> _ >> From: Jakob Odersky <ja...@odersky.com> >> Sent: Thursday, March 17, 2016 6:40 PM >> Subject: Re: installing pa

Re: I want to unsubscribe

2016-04-05 Thread Jakob Odersky
to unsubscribe, send an email to user-unsubscr...@spark.apache.org On Tue, Apr 5, 2016 at 4:50 PM, Ranjana Rajendran wrote: > I get to see the threads in the public mailing list. I don;t want so many > messages in my inbox. I want to unsubscribe.

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Jakob Odersky
Doesn't FileInputFormat require type parameters? Like so: class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat[LW, RD] I haven't verified this but it could be related to the compile error you're getting. On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
Hi, regarding 1, packages are resolved locally. That means that when you specify a package, spark-submit will resolve the dependencies and download any jars on the local machine, before shipping* them to the cluster. So, without a priori knowledge of dataproc clusters, it should be no different to

Re: ClassNotFoundException in RDD.map

2016-03-20 Thread Jakob Odersky
The error is very strange indeed, however without code that reproduces it, we can't really provide much help beyond speculation. One thing that stood out to me immediately is that you say you have an RDD of Any where every Any should be a BigDecimal, so why not specify that type information? When

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky
Can you share a snippet that reproduces the error? What was spark.sql.autoBroadcastJoinThreshold before your last change? On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový wrote: > Hi, > > any idea what could be causing this issue? It started appearing after > changing

Re: Building spark submodule source code

2016-03-21 Thread Jakob Odersky
Another gotcha to watch out for are the SPARK_* environment variables. Have you exported SPARK_HOME? In that case, 'spark-shell' will use Spark from the variable, regardless of the place the script is called from. I.e. if SPARK_HOME points to a release version of Spark, your code changes will

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Jakob Odersky
Spark actually used to depend on Akka. Unfortunately this brought in all of Akka's dependencies (in addition to Spark's already quite complex dependency graph) and, as Todd mentioned, led to conflicts with projects using both Spark and Akka. It would probably be possible to use Akka and shade it

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Jakob Odersky
Hi Eli, to build spark, just run build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests package in your source directory, where package is the actual word "package". This will recompile the whole project, so it may take a while when running the first time. Replacing a single file

Re: Error in Word Count Program

2016-07-19 Thread Jakob Odersky
Does the file /home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md exist? On Tue, Jul 19, 2016 at 4:30 AM, RK Spark wrote: > val textFile = sc.textFile("README.md")val linesWithSpark = > textFile.filter(line => line.contains("Spark")) >

Re: Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Jakob Odersky
Spark currently requires at least Java 1.7, so adding a Java 1.8-specific encoder will not be straightforward without affecting requirements. I can think of two solutions: 1. add a Java 1.8 build profile which includes such encoders (this may be useful for Scala 2.12 support in the future as

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
> However, what really worries me is not having Dataset APIs at all in Python. I think thats a deal breaker. What is the functionality you are missing? In Spark 2.0 a DataFrame is just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in core/.../o/a/s/sql/package.scala). Since python is

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky
Hi Aris, thanks for sharing this issue. I can confirm that value classes currently don't work, however I can't think of reason why they shouldn't be supported. I would therefore recommend that you report this as a bug. (Btw, value classes also currently aren't definable in the REPL. See

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky
I'm not sure how the shepherd thing works, but just FYI Michael Armbrust originally wrote Catalyst, the engine behind Datasets. You can find a list of all committers here https://cwiki.apache.org/confluence/display/SPARK/Committers. Another good resource is to check https://spark-prs.appspot.com/

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
w>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>>

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
As you point out, often the reason that Python support lags behind is that functionality is implemented in Scala, so the API in that language is "free" whereas Python support needs to be added explicitly. Nevertheless, Python bindings are an important part of Spark and is used by many people (this

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
Forgot to answer your question about feature parity of Python w.r.t. Spark's different components I mostly work with scala so I can't say for sure but I think that all pre-2.0 features (that's basically everything except Structured Streaming) are on par. Structured Streaming is a pretty new

Re: How to use custom class in DataSet

2016-08-30 Thread Jakob Odersky
Implementing custom encoders is unfortunately not well supported at the moment (IIRC there are plans to eventually add an api for user defined encoders). That being said, there are a couple of encoders that can work with generic, serializable data types: "javaSerialization" and "kryo", found here

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Jakob Odersky
(Maybe unrelated FYI): in case you're using only Scala or Java with Spark, I would recommend to use Datasets instead of DataFrames. They provide exactly the same functionality, yet offer more type-safety. On Thu, Sep 8, 2016 at 11:05 AM, Lee Becker wrote: > > On Thu, Sep

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
Hi Sujeet, going sequentially over all parallel, distributed data seems like a counter-productive thing to do. What are you trying to accomplish? regards, --Jakob On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog wrote: > Hi, > Is there a way to iterate over a DataFrame with n

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
st use-case of Spark though and will probably be a performance bottleneck. On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky <ja...@odersky.com> wrote: > Hi Sujeet, > > going sequentially over all parallel, distributed data seems like a > counter-productive thing to do. What are you

Re: Can I assign affinity for spark executor processes?

2016-09-13 Thread Jakob Odersky
Hi Xiaoye, could it be that the executors were spawned before the affinity was set on the worker? Would it help to start spark worker with taskset from the beginning, i.e. "taskset [mask] start-slave.sh"? Workers in spark (standalone mode) simply create processes with the standard java process

Re: Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Jakob Odersky
It's still there on master. It is in the "spark-tags" module however (under common/tags), maybe something changed in the build environment and it isn't made available as a dependency to your project? What happens if you include the module as a direct dependency? --Jakob On Tue, Oct 4, 2016 at

Re: Apache Spark JavaRDD pipe() need help

2016-09-21 Thread Jakob Odersky
Can you provide more details? It's unclear what you're asking On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com wrote: > Hi All, > > I am trying to use the JavaRDD.pipe() API. > > I have one object with me from the JavaRDD

Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread Jakob Odersky
.pipe() > API. If there is any other way let me know. This code will be executed in > all the nodes in a cluster. > > Hope my requirement is now clear. How to do this? > > Regards, > Shash > > On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky <ja...@odersky.com> wrote: >> &

Re: Task Deserialization Error

2016-09-21 Thread Jakob Odersky
Your app is fine, I think the error has to do with the way inttelij launches applications. Is your app forked in a new jvm when you run it? On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D wrote: > Hello Sumit - > > I could see that SparkConf() specification is not being

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Jakob Odersky
One option would be to use Apache Toree. A quick setup guide can be found here https://toree.incubator.apache.org/documentation/user/quick-start On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka wrote: > Has anyone installed the scala kernel for Jupyter notebook. > > > > Any

Re: get different results when debugging and running scala program

2016-09-30 Thread Jakob Odersky
There is no image attached, I'm not sure how the apache mailing lists handle them. Can you provide the output as text? best, --Jakob On Fri, Sep 30, 2016 at 8:25 AM, chen yong wrote: > Hello All, > > > > I am using IDEA 15.0.4 to debug a scala program. It is strange to me

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Ho do you submit the application? A version mismatch between the launcher, driver and workers could lead to the bug you're seeing. A common reason for a mismatch is if the SPARK_HOME environment variable is set. This will cause the spark-submit script to use the launcher determined by that

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Jakob Odersky
Another reason I could imagine is that files are often read from HDFS, which by default uses line terminators to separate records. It is possible to implement your own hdfs delimiter finder, however for arbitrary json data, finding that delimiter would require stateful parsing of the file and

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
se I > need to go digging further then. Thanks for the quick help. > > On Mon, Oct 24, 2016 at 7:34 PM Jakob Odersky <ja...@odersky.com> wrote: >> >> What you're seeing is merely a strange representation, 0E-18 is zero. >> The E-18 represents the precision

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
What you're seeing is merely a strange representation, 0E-18 is zero. The E-18 represents the precision that Spark uses to store the decimal On Mon, Oct 24, 2016 at 7:32 PM, Jakob Odersky <ja...@odersky.com> wrote: > An even smaller example that demonstrates the same behaviour: > &g

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
An even smaller example that demonstrates the same behaviour: Seq(Data(BigDecimal(0))).toDS.head On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk wrote: > I’m trying to track down what seems to be a very slight imprecision in our > Spark application; two of our columns, which

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
there are libraries of multiple scala versions on the same classpath. You mention that it worked before, can you recall what libraries you upgraded before it broke? --Jakob On Mon, Nov 21, 2016 at 2:34 PM, Jakob Odersky <ja...@odersky.com> wrote: > Trying it out locally gave me an NPE.

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
Trying it out locally gave me an NPE. I'll look into it in more detail, however the SparkILoop.run() method is dead code. It's used nowhere in spark and can be removed without any issues. On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi wrote: > Thanks Holden. I did post to

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Jakob Odersky
> how do I tell my spark driver program to not create so many? This may depend on your driver program. Do you spawn any threads in it? Could you share some more information on the driver program, spark version and your environment? It would greatly help others to help you On Mon, Oct 31, 2016

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Just thought of another potential issue: you should use the "provided" scope when depending on spark. I.e in your project's pom: org.apache.spark spark-core_2.11 2.0.1 provided On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky <ja..

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jakob Odersky
Assuming the bottleneck is IO, you could try saving your files to HDFS. This will distribute your data and allow for better concurrent reads. On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote: > Hi, > > I have millions of html files in a directory, using "wholeTextFiles" api

Re: Third party library

2016-12-13 Thread Jakob Odersky
Hi Vineet, great to see you solved the problem! Since this just appeared in my inbox, I wanted to take the opportunity for a shameless plug: https://github.com/jodersky/sbt-jni. In case you're using sbt and also developing the native library, this plugin may help with the pains of building and

Re: custom generate spark application id

2016-12-05 Thread Jakob Odersky
The app ID is assigned internally by spark's task scheduler https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L35. You could probably change the naming, however I'm pretty sure that the ID will always have to be unique for a context on a