Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-09 Thread Terry Hole
Sean, Thank you! Finally, I get this to work, although it is a bit ugly: manually to set the meta data of dataframe. import org.apache.spark.ml.attribute._ import org.apache.spark.sql.types._ val df = training.toDF() val schema = df.schema val rowRDD = df.rdd def enrich(m : Metadata) : Metadata

Re: Contribution in Apche Spark

2015-09-09 Thread Akhil Das
Have a look https://issues.apache.org/jira/browse/spark/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel Thanks Best Regards On Wed, Sep 9, 2015 at 9:50 AM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > I want to contribute in Apache dspark especially in MLlib in

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-09 Thread Ashish Dutt
Dear Sasha, What I did was that I installed the parcels on all the nodes of the cluster. Typically the location was /opt/cloudera/parcels/CDH5.4.2-1.cdh5.4.2.p0.2 Hope this helps you. With regards, Ashish On Tue, Sep 8, 2015 at 10:18 PM, Sasha Kacanski wrote: > Hi

Task serialization error for mllib.MovieLensALS

2015-09-09 Thread Jeff Zhang
I run the MovieLensALS, but meet the following error. The weird thing is that this issue only appear under openjdk. And this is based on the 1.5, I found several related tickets, not sure has anyone else meet the same issue and know the solution ? Thanks

Re: foreachRDD causing executor lost failure

2015-09-09 Thread Akhil Das
If you can look a bit in the executor logs, you would see the exact reason (mostly a OOM/GC etc). Instead of using foreach, try to use mapPartitions or foreachPartitions. Thanks Best Regards On Tue, Sep 8, 2015 at 10:45 PM, Priya Ch wrote: > Hello All, > > I am

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Aniket Bhatnagar
Hi Tom There has to be a difference in classpaths in yarn-client and yarn-cluster mode. Perhaps a good starting point would be to print classpath as a first thing in SimpleApp.main. It should give clues around why it works in yarn-cluster mode. Thanks, Aniket On Wed, Sep 9, 2015, 2:11 PM Tom

Re: Partitions with zero records & variable task times

2015-09-09 Thread Akhil Das
This post here has a bit information http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/ Thanks Best Regards On Wed, Sep 9, 2015 at 6:44 AM, mark wrote: > As I understand things (maybe naively), my

Re: Spark ANN

2015-09-09 Thread Feynman Liang
My 2 cents: * There is frequency domain processing available already (e.g. spark.ml DCT transformer) but no FFT transformer yet because complex numbers are not currently a Spark SQL datatype * We shouldn't assume signals are even, so we need complex numbers to implement the FFT * I have not

Re: Best way to import data from Oracle to Spark?

2015-09-09 Thread Reynold Xin
Using the JDBC data source is probably the best way. http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best

java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
Hi, I have a problem trying to get a fairly simple app working which makes use of native avro libraries. The app runs fine on my local machine and in yarn-cluster mode, but when I try to run it on EMR yarn-client mode I get the error below. I'm aware this is a version problem, as EMR runs an

Re: How to read compressed parquet file

2015-09-09 Thread Cheng Lian
You need to use "har://" instead of "hdfs://" to read HAR files. Just tested against Spark 1.5, and it works as expected. Cheng On 9/9/15 3:29 PM, 李铖 wrote: I think too many parquet files may be affect reading capability,so I use hadoop archive to combine them,but

How to read compressed parquet file

2015-09-09 Thread 李铖
I think too many parquet files may be affect reading capability,so I use hadoop archive to combine them,but sql_context.read.parquet(output_path) does not work on the file. How to fix it ,please help me. :)

Re: No auto decompress in Spark Java textFile function?

2015-09-09 Thread Akhil Das
textFile used to work with .gz files, i haven't tested it on bz2 files. If it isn't decompressing by default then what you have to do is to use the sc.wholeTextFiles and then decompress each record (that being file) with the corresponding codec. Thanks Best Regards On Tue, Sep 8, 2015 at 6:49

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this

I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of

2015-09-09 Thread prachicsa
I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: val ECtokens = for (token <- listofECtokens)

Re: Spark summit Asia

2015-09-09 Thread mark
http://www.stratahadoopworld.com/singapore/index.html On 8 Sep 2015 8:35 am, "Kevin Jung" wrote: > Is there any plan to hold Spark summit in Asia? > I'm very much looking forward to it. > > Kevin > > > > -- > View this message in context: >

Filtering records for all values of an array in Spark

2015-09-09 Thread prachicsa
I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: val ECtokens = for (token <- listofECtokens)

Re: How to read files from S3 from Spark local when there is a http proxy

2015-09-09 Thread Steve Loughran
s3a:// has a proxy option https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html s3n: apparently gets set up differently, though I've never tested it http://stackoverflow.com/questions/20241953/hadoop-distcp-to-s3-behind-http-proxy > On 8 Sep 2015, at 13:51, tariq

Help getting Spark JDBC metadata

2015-09-09 Thread Tom Barber
Hi guys Hopefully someone can help me, or at least explain stuff to me. I use a tool that required JDBC metadata (tables/columns etc) So using spark 1.3.1 I try stuff like: registerTempTable() or saveAsTable() on my parquet file. The former doesn't show any table metadata for JDBC

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Ted Yu
Have you checked the contents of __app__.jar ? > On Sep 9, 2015, at 3:28 AM, Tom Seddon wrote: > > Thanks for your reply Aniket. > > Ok I've done this and I'm still confused. Output from running locally shows: > >

RE: Support of other languages?

2015-09-09 Thread Sun, Rui
Hi, Rahul, To support a new language other than Java/Scala in spark, it is different between RDD API and DataFrame API. For RDD API: RDD is a distributed collection of the language-specific data types whose representation is unknown to JVM. Also transformation functions for RDD are written

bad substitution for [hdp.version] Error in spark on YARN job

2015-09-09 Thread Jeetendra Gangele
Hi , I am getting below error when running the spark job on YARN with HDP cluster. I have installed spark and yarn from Ambari and I am using spark 1.3.1 with HDP version HDP-2.3.0.0-2557. My spark-default.conf has correct entry spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557

Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o

2015-09-09 Thread Ted Yu
Prachicsa: If the number of EC tokens is high, please consider using a set instead of array for better lookup performance. BTW use short, descriptive subject for future emails. > On Sep 9, 2015, at 3:13 AM, Akhil Das wrote: > > Try this: > > val tocks =

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-09 Thread Umesh Kacha
Hi Richard, thanks for the response. My use case is weird I need to process data row by row for one partition and update required rows. Updated rows percentage would be 30%. As per above stackoverflow.com answer suggestions I refactored code to use mappartitionswithindex JavaRDD indexedRdd =

I want to know the parition result in each node

2015-09-09 Thread szy
I am doing a scientific research about graph computing.And I decide to use SPARK GRAPHX to do some experiments.Now I have a question about partition.How can I get each partition in each workers in my SPARK program ? For example,if a graph includes vertexs 1,2,3,4.And I have two workers.After

Re: Applying transformations on a JavaRDD using reflection

2015-09-09 Thread Robin East
Have you got some code already that demonstrates the problem? > On 9 Sep 2015, at 04:45, Nirmal Fernando wrote: > > Any thoughts? > > On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando > wrote: > Hi All, > > I'd like to apply a chain of

Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o

2015-09-09 Thread Akhil Das
Try this: val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A") val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This doesnt")) rddAll.filter(line => { var found = false for(item <- tocks){ if(line.contains(item)) found = true } found

Re: How to read compressed parquet file

2015-09-09 Thread 李铖
It works. at spark 1.4 Thanks a lot. 2015-09-09 17:21 GMT+08:00 Cheng Lian : > You need to use "har://" instead of "hdfs://" to read HAR files. Just > tested against Spark 1.5, and it works as expected. > > Cheng > > > On 9/9/15 3:29 PM, 李铖 wrote: > > I think too many

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
Thanks for your reply Aniket. Ok I've done this and I'm still confused. Output from running locally shows: file:/home/tom/spark-avro/target/scala-2.10/simpleapp.jar file:/home/tom/spark-1.4.0-bin-hadoop2.4/conf/ file:/home/tom/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
Thanks Ben for your answer. I’ll explore what happens under the hoods in a data frame. Regarding the ability to split a large RDD into n RDDs without requiring n passes to the large RDD. Can partitionBy() help? If I partition by a key that corresponds to the the split criteria (i..e client

Re: long running Spark Streaming job and eventlog files

2015-09-09 Thread jarod7736
I believe I just added more disk space, and wrote a script to clean up the files after a few days. Not really a good resolution, but it works. > jamespowenjr [via Apache Spark User List] > > September 8, 2015 at 12:55 PM > Jarod, > I'm having

[Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread shahab
Hi, I am using Spark on Amazon EMR. So far I have not succeeded to submit the application successfully, not sure what's problem. In the log file I see the followings. java.io.FileNotFoundException: File does not exist:

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
Standalone. On 09/08/2015 11:18 PM, Jeff Zhang wrote: What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch wrote: Our Spark cluster is configured to write application history event logging to a directory on HDFS.

Re: No auto decompress in Spark Java textFile function?

2015-09-09 Thread Chris Teoh
Thanks. What I noticed was the decompress works if the file is in HDFS but not when it is a local file when working in a development environment. Does anyone else have the same problem? On Wed, 9 Sep 2015 at 4:40 pm Akhil Das wrote: > textFile used to work with .gz

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Cody Koeninger
It looked like from your graphs that you had a 10 second batch time, but that your processing time was consistently 11 seconds. If that's correct, then yes your delay is going to keep growing. You'd need to either increase your batch time, or get your processing time down (either by adding more

JNI issues with mesos

2015-09-09 Thread Adrian Bridgett
I'm trying to run spark (1.4.1) on top of mesos (0.23). I've followed the instructions (uploaded spark tarball to HDFS, set executor uri in both places etc) and yet on the slaves it's failing to lauch even the SparkPi example with a JNI error. It does run with a local master. A day of

spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-09 Thread Marius Soutier
Hi all, as indicated in the title, I’m using Kryo with a custom Kryo serializer, but as soon as I enable `spark.kryo.registrationRequired`, my Spark Streaming job fails to start with this exception: Class is not registered: scala.collection.immutable.Range When I register it, it continues

spark history server + yarn log aggregation issue

2015-09-09 Thread michael.england
Hi, I am running Spark-on-YARN on a secure cluster with yarn log aggregation set up. Once a job completes, when viewing stdout/stderr executor logs in the Spark history server UI it redirects me to the local nodemanager where a page appears for a second saying ‘Redirecting to log server….’ and

Re: Java vs. Scala for Spark

2015-09-09 Thread Cody Koeninger
Java 8 lambdas are broken to the point of near-uselessness (because of checked exceptions and inability to close over non-final references). I wouldn't use them as a deciding factor in language choice. Any competent developer should be able to write reasonable java-in-scala after a week and

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Понькин Алексей
That`s correct, I have 10 seconds batch. The problem is actually in processing time, it is increasing constantly no matter how small or large my window duration is. I am trying to prepare some example code to clarify my use case. -- Яндекс.Почта — надёжная почта

Loading json data into Pair RDD in Spark using java

2015-09-09 Thread prachicsa
I am very new to Spark. I have a very basic question. I read a file in Spark RDD in which each line is a JSON. I want to make apply groupBy like transformations. So I want to transform each JSON line into a PairRDD. Is there a straight forward way to do it in Java? My json is like this: {

Re: Event logging not working when worker machine terminated

2015-09-09 Thread Charles Chao
I have encountered the same problem after migrating from 1.2.2 to 1.3.0. After some searching this appears to be a bug introduced in 1.3. Hopefully it¹s fixed in 1.4. Thanks, Charles On 9/9/15, 7:30 AM, "David Rosenstrauch" wrote: >Standalone. > >On 09/08/2015 11:18

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
Thanks for the info. Do you know if there's a ticket already open for this issue? If so, I'd like to monitor it. Thanks, DR On 09/09/2015 11:50 AM, Charles Chao wrote: I have encountered the same problem after migrating from 1.2.2 to 1.3.0. After some searching this appears to be a bug

Re: Loading json data into Pair RDD in Spark using java

2015-09-09 Thread Ted Yu
Please take a look at the example in SPARK-10287 FYI On Wed, Sep 9, 2015 at 8:50 AM, prachicsa wrote: > > > I am very new to Spark. > > I have a very basic question. I read a file in Spark RDD in which each line > is a JSON. I want to make apply groupBy like

Re: JNI issues with mesos

2015-09-09 Thread Adrian Bridgett
5mins later... Trying 1.5 with a fairly plain build: ./make-distribution.sh --tgz --name os1 -Phadoop-2.6 and on my first attempt stderr showed: I0909 15:16:49.392144 1619 fetcher.cpp:441] Fetched 'hdfs:///apps/spark/spark15.tgz' to

Re: New to Spark - Paritioning Question

2015-09-09 Thread Richard Marscher
Ah I see. In that case, the groupByKey function does guarantee every key is on exactly one partition matched with the aggregated data. This can be improved depending on what you want to do after. Group by key only aggregates the data after shipping it across the cluster. Meanwhile, using

Re: Event logging not working when worker machine terminated

2015-09-09 Thread Charles Chao
Fixed in 1.3.1 https://issues.apache.org/jira/browse/SPARK-6950 Thanks, Charles On 9/9/15, 8:54 AM, "David Rosenstrauch" wrote: >Thanks for the info. Do you know if there's a ticket already open for >this issue? If so, I'd like to monitor it. > >Thanks, > >DR > >On

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
I see you reposted with more details: "I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4 cores." If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in the

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-09-09 Thread Kristina Rogale Plazonic
Thanks so much! I was just looking for it today on spark-packages - you've read my mind :) On Wed, Sep 9, 2015 at 5:53 PM, Burak Yavuz wrote: > By the way, I published > http://spark-packages.org/package/brkyvz/lazy-linalg that contains many > of the arithmetic operations for

Re: Partitions with zero records & variable task times

2015-09-09 Thread mark
The article is interesting but doesn't really help. It has only one sentence about data distribution in partitions. How can I diagnose skewed data distribution? How could evenly sized blocks in HDFS lead to skewed data anyway? On 9 Sep 2015 2:29 pm, "Akhil Das"

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-09 Thread Ted Yu
Which release of Spark are you using ? Can you show skeleton of your partitioner and comparator ? Thanks > On Sep 9, 2015, at 4:45 PM, Ashish Shenoy wrote: > > Hi, > > I am trying to sort a RDD pair using repartitionAndSortWithinPartitions() for > my key [which

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Eric Walker
Hi Richard, I've stepped away from this issue since I raised my question. An additional detail that was unknown at the time was that not in every instance when the spilling to disk was encountered did the application run out of disk space; that problem appears to have been a one-off problem. The

Re: Spark MLlib Decision Tree Node Accuracy

2015-09-09 Thread sethah
If you are able to traverse the tree, then you can extract the id of the leaf node for each feature vector. This is like a modified predict method where it returns the leaf node assigned to the data point instead of the prediction for that leaf node. The following example code should work:

Accumulator with non-java-serializable value ?

2015-09-09 Thread Thomas Dudziak
I want to use t-digest with foreachPartition and accumulators (essentially, create a t-digest per partition and add that to the accumulator leveraging the fact that t-digests can be added to each other). I can make t-digests kryo-serializable easily but java-serializable is not very easy. Now,

Re: Does Spark.ml LogisticRegression assumes only Double valued features?

2015-09-09 Thread sethah
When you pass a data frame into the train method of LogisticRegression and other ML learning algorithms, the data is extracted by using parameters `labelCol` and `featuresCol` which should have been set before calling the train method (they default to "label" and "features", respectively).

ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-09 Thread Ashish Shenoy
Hi, I am trying to sort a RDD pair using repartitionAndSortWithinPartitions() for my key [which is a custom class, not a java primitive] using a custom partitioner on that key and a custom comparator. However, it fails consistently: org.apache.spark.SparkException: Job aborted due to stage

RE: Spark ANN

2015-09-09 Thread Ulanov, Alexander
Thank you, Feynman, this is helpful. The paper that I linked claims a big speedup with FFTs for large convolution size. Though as you mentioned there is no FFT transformer in Spark yet. Moreover, we would need a parallel version of FFTs to support batch computations. So it probably worth

Re: Can not allocate executor when running spark on mesos

2015-09-09 Thread canan chen
Finally got the answer. Actually it works fine. The allocation behavior on mesos is a little different from yarn/standalone. Seems the executor in mesos is lazily allocated (only when job is executed) while executor in yarn/standalone is allocated when spark-shell is started. On Tue, Sep 8,

Re: Is HDFS required for Spark streaming?

2015-09-09 Thread N B
Thanks Cody and TD. If we do run with local directories, I suppose the checkpoint operation will write the various partitions of an RDD into their own local dirs (of course). So what's the worst that can happen in case of a node failure? Will the streaming batches continue to process (i.e. does

Tungsten and Spark Streaming

2015-09-09 Thread N B
Hello, How can we start taking advantage of the performance gains made under Project Tungsten in Spark 1.5 for a Spark Streaming program? >From what I understand, this is available by default for Dataframes. But for a program written using Spark Streaming, would we see any potential gains "out

Creating Parquet external table using HiveContext API

2015-09-09 Thread Mohammad Islam
Hi,I want to create  an external hive table using HiveContext. I have the following :1. full path/location of parquet data directory2. name of the new table3. I can get the schema as well. What API will be the best (for 1,3.x or 1.4.x)? I can see 6 createExternalTable() APIs but not sure which

Re: build on spark 1.5.0 error with Execution scala-compile-first of goal & Compile failed via zinc server

2015-09-09 Thread Ted Yu
I used your first command with mvn 3.3.3 (without build/) The build passed. FYI On Wed, Sep 9, 2015 at 8:50 PM, stark_summer wrote: > codeurl: http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0.tgz > build scripts: > > build/mvn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.0

Re: Failed when starting Spark 1.5.0 standalone cluster

2015-09-09 Thread Ted Yu
See the following announcement: http://search-hadoop.com/m/q3RTtojAyW1dabFk On Wed, Sep 9, 2015 at 9:05 PM, Netwaver wrote: > Hi Spark experts, > I am trying to migrate my Spark cluster from > 1.4.1 to latest 1.5.0 , but meet below issues when run

spark streaming 1.3 with kafka connection timeout

2015-09-09 Thread Shushant Arora
Executors in spark streaming 1.3 fetch messages from kafka in batches and what happens when executor takes longer time to complete a fetch batch say in directKafkaStream.foreachRDD(new Function, Void>() { @Override public Void call(JavaRDD v1) throws Exception {

RE: Driver OOM after upgrading to 1.5

2015-09-09 Thread Cheng, Hao
Will that be helpful if adding jvm options like: -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, September 10, 2015 5:31 AM To: Sandy Ryza Cc: user@spark.apache.org Subject: Re: Driver OOM after upgrading to 1.5 It's

Filtering an rdd depending upon a list of values in Spark

2015-09-09 Thread prachicsa
I want to apply filter based on a list of values in Spark. This is how I get the list: DataFrame df = sqlContext.read().json("../sample.json"); df.groupBy("token").count().show(); Tokens = df.select("token").collect(); for(int i = 0; i < Tokens.length; i++){

Re: Filtering an rdd depending upon a list of values in Spark

2015-09-09 Thread Ted Yu
Take a look at the following methods: * Filters rows using the given condition. * {{{ * // The following are equivalent: * peopleDf.filter($"age" > 15) * peopleDf.where($"age" > 15) * }}} * @group dfops * @since 1.3.0 */ def filter(condition: Column): DataFrame

build on spark 1.5.0 error with Execution scala-compile-first of goal & Compile failed via zinc server

2015-09-09 Thread stark_summer
codeurl: http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0.tgz build scripts: build/mvn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.0 -Pyarn -Phive -Dhive-version=0.12.0-cdh5.1.0 -Phive-thriftserver -Pspark-ganglia-lgpl -DskipTests clean package or ./make-distribution.sh --name 2.3.0 --tgz

Failed when starting Spark 1.5.0 standalone cluster

2015-09-09 Thread Netwaver
Hi Spark experts, I am trying to migrate my Spark cluster from 1.4.1 to latest 1.5.0 , but meet below issues when run start-all.sh script. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/launcher/Main Caused by:

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Tathagata Das
The whole point of checkpointing is to recover the *exact* computation where it left of. If you want any change in the specification of the computation (which includes any intervals), then you cannot recover from checkpoint as it can be an arbitrarily complex issue to deal with changes in the

Re:Re: Failed when starting Spark 1.5.0 standalone cluster

2015-09-09 Thread Netwaver
Thank you, Ted, This does help. One more question, If I just want to migrate JDK only for Spark on my cluster machines, where can I add the JAVA_HOME environment variable? Does conf/spark-env.sh support JAVA_HOME environment variable? Thanks a lot. 在 2015-09-10 12:45:43,"Ted Yu"

RE: Spark streaming -> cassandra : Fault Tolerance

2015-09-09 Thread Samya MAITI
Hi Cody, Thanks for your reply. Is there a way in Spark-Kafka-Direct API, so that if an exception to write to Cassandra occurs, we stop updating the checkpoint ? In this way, there will be no message lost, once cassandra comes up, we can start reading from the point we left off. Regards, Sam

Re: Tungsten and Spark Streaming

2015-09-09 Thread Tathagata Das
Rewriting is necessary. You will have to convert RDD/DStream operations to DataFrame operations. So get the RDDs in DStream, using transform/foreachRDD, convert to DataFrames and then do DataFrame operations. On Wed, Sep 9, 2015 at 9:23 PM, N B wrote: > Hello, > > How can

Re: Is HDFS required for Spark streaming?

2015-09-09 Thread Tathagata Das
Actually, i think it wont work. If you are using some operation that requires RDD checkpointing, then if the checkpoint files cannot be read (because executor failed), any subsequent operations that needs that state data cannot continue. So all subsequent batches will fail. You could reduce the

How to keep history of streaming statistics

2015-09-09 Thread b.bhavesh
Hello, How can I keep history of streaming statistics for completed applications. Where in Spark, the information presented on UI is stored? Thanks, b.bhavesh -- View this message in context:

SparkR - Support for Other Models

2015-09-09 Thread Manish MAHESHWARI
Hello, Is there a time line to add support for other model types like SVD, Cluster, GBM etc in the subsequent releases. 1.5 Added support for Linear models only. If there is, where can we know the tentative timeline of the same. Thanks, Manish CONFIDENTIAL NOTE: The information contained in

Re: JNI issues with mesos

2015-09-09 Thread Adrian Bridgett
Thanks Tim, There's a little more to it in fact - if I use the pre-built-with-hadoop-2.6 binaries, all is good (with correctly named tarballs in hdfs). Using the pre-built with user-provided hadoop (including setting SPARK_DIST_CLASSPATH in setup-env.sh) then I get the JNI exception. Aha

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
>> when you use getOrCreate, and there exists a valid checkpoint, it will always return the context from the checkpoint and not call the factory. Simple way to see whats going on is to print something in the factory to verify whether it is ever called. This is probably OK. Seems to explain why we

Spark streaming -> cassandra : Fault Tolerance

2015-09-09 Thread Samya
Hi Team, I have an sample spark application which reads from Kafka using direct API & then does some transformation & stores to cassandra (using saveToCassandra()). If Cassandra goes down, then application logs NoHostAvailable exception (as expected). But in the mean time the new incoming

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Java 7 / 8? On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza wrote: > I just upgraded the spark-timeseries > project to run on top of > 1.5, and I'm noticing that tests are failing with OOMEs. > > I ran a jmap -histo on the

Re: Spark Streaming checkpoints and code upgrade

2015-09-09 Thread Tathagata Das
Its pretty much impossible to do across arbitrary code changes. For that, the best way is to go forward is the store and load the offsets yourselves. On Wed, Sep 9, 2015 at 10:19 AM, Nicolas Monchy wrote: > Hello, > > I am using Spark Streaming and the Kafka Direct API and I

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Tathagata Das
See inline. On Tue, Sep 8, 2015 at 9:02 PM, Dmitry Goldenberg wrote: > What's wrong with creating a checkpointed context?? We WANT > checkpointing, first of all. We therefore WANT the checkpointed context. > > Second of all, it's not true that we're loading the

Re: Spark Streaming checkpoints and code upgrade

2015-09-09 Thread Nicolas Monchy
Ok. Thank you for the reply. On Wed, Sep 9, 2015 at 11:40 AM, Tathagata Das wrote: > Its pretty much impossible to do across arbitrary code changes. For that, > the best way is to go forward is the store and load the offsets yourselves. > > On Wed, Sep 9, 2015 at 10:19 AM,

Re: performance when checking if data frame is empty or not

2015-09-09 Thread Ted Yu
Have you tried: df.rdd.isEmpty Cheers On Tue, Sep 8, 2015 at 1:22 PM, Axel Dahl wrote: > I have a join, that fails when one of the data frames is empty. > > To avoid this I am hoping to check if the dataframe is empty or not before > the join. > > The question is

Cores per executors

2015-09-09 Thread Thomas Gerber
Hello, I was wondering how Spark was enforcing to use *only* X number of cores per executor. Is it simply running max Y tasks in parallel on each executor where X = Y * spark.task.cpus? (This is what I understood from browsing TaskSchedulerImpl). Which would mean the processing power used

Problems with Local Checkpoints

2015-09-09 Thread Bryan Jeffrey
Hello. I have some basic code that counts numbers using updateStateByKey. I setup a streaming context with checkpointing as follows: def createStreamingContext(masterName : String, checkpointDirectory : String, timeWindow : Int) : StreamingContext = { val sparkConf = new

Spark UI keep redirecting to /null and returns 500

2015-09-09 Thread Rajeev Prasad
Hi All, I am having problem in accessing spark UI while running in spark-client mode. It works fine in local mode. It keeps redirecting back to itself by adding /null at the end and ultimately run out of size limit for url and returns 500. Look at following below. I have a feeling that I might

Re: JNI issues with mesos

2015-09-09 Thread Tim Chen
Hi Adrian, Spark is expecting a specific naming of the tgz and also the folder name inside, as this is generated by running make-distribution.sh --tgz in the Spark source folder. If you use a Spark 1.4 tgz generated with that script with the same name and upload to HDFS again, fix the URI then

Re: bad substitution for [hdp.version] Error in spark on YARN job

2015-09-09 Thread Jeetendra Gangele
Finally it did worked out solved it modifying the mapred-site.xml removed the entry for application yarn master(from this property removed the HDP version things). On 9 September 2015 at 17:44, Jeetendra Gangele wrote: > Hi , > I am getting below error when running the

Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
I just upgraded the spark-timeseries project to run on top of 1.5, and I'm noticing that tests are failing with OOMEs. I ran a jmap -histo on the process and discovered the top heap items to be: 1:163428 22236064 2:

Spark Streaming checkpoints and code upgrade

2015-09-09 Thread Nicolas Monchy
Hello, I am using Spark Streaming and the Kafka Direct API and I am checkpointing the metadata. Checkpoints aren't recoverable if you upgrade code so I am losing the last consumed offsets in this case. I know I can build a system to store and load the offsets for each batch but before

Spark rdd.mapPartitionsWithIndex() hits physical memory limit after huge data shuffle

2015-09-09 Thread unk1102
Hi I have the following Spark code which involves huge data shuffling even though using mapPartitionswithIndex() with shuffle false. I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
Adding an example (very raw), to see if my understanding is correct: val repartitioned = bidRdd.partitionBy(new Partitioner { def numPartitions: Int = 100 def getPartition(clientId: Any): Int = hash(clientId) % 100 } val cachedRdd = repartitioned.cache() val client1Rdd =

Re: Spark streaming -> cassandra : Fault Tolerance

2015-09-09 Thread Cody Koeninger
It's been a while since I've looked at the cassandra connector, so I can't give you specific advice on it. But in general, if a spark task fails (uncaught exception), it will be retried automatically. In the case of the kafka direct stream rdd, it will have exactly the same messages as the first

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
Java 7. FWIW I was just able to get it to work by increasing MaxPermSize to 256m. -Sandy On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin wrote: > Java 7 / 8? > > On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza > wrote: > >> I just upgraded the

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread Dean Wampler
If you log into the cluster, do you see the file if you type: hdfs dfs -ls hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar (with the correct server address for "ipx-x-x-x"). If not, is the server address correct and routable

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
It's likely that with codegen, you need a bigger permgen space. Also I found that Java 7 doesn't do very well w.r.t. flushing code cache. As a result, Spark SQL and DataFrames now run much better under Java 8, because it flushes code cache better. On Wed, Sep 9, 2015 at 2:12 PM, Sandy Ryza

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
Do you have any details about the cluster you are running this against? The memory per executor/node, number of executors, and such? Even at a shuffle setting of 1000 that would be roughly 1GB per partition assuming the 1TB of data includes overheads in the JVM. Maybe try another order of

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Richard Marscher
Hi Eric, I just wanted to do a sanity check, do you know what paths it is trying to write to? I ask because even without spilling, shuffles always write to disk first before transferring data across the network. I had at one point encountered this myself where we accidentally had /tmp mounted on

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-09-09 Thread Burak Yavuz
By the way, I published http://spark-packages.org/package/brkyvz/lazy-linalg that contains many of the arithmetic operations for use in Scala. I really would appreciate any feedback! On Tue, Aug 25, 2015 at 11:06 AM, Kristina Rogale Plazonic wrote: > YES PLEASE! > > :)))

  1   2   >