Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
What Spark version are you using, also a small code snippet of how you use Spark Streaming would be greatly helpful. On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V wrote: > I can able to read and print few lines. Afterthat i'm getting this > exception. Any idea for this ?

Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda
Hi, I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following error. *build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests clean package* [INFO] Spark Project Parent POM ... SUCCESS [ 9.812 s] [INFO] Spark Project Launcher

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda
Thanks for the reply. I am using the mvn and scala from the source code build/mvn only and I get the same error without hadoop also after clean package. *Java Version:* *rchanda@ubuntu:~/Downloads/spark-1.5.1$ java -version* *java version "1.7.0_85"* *OpenJDK Runtime Environment (IcedTea

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda
There seems to be a error at the zinc server, how can I shut down the zinc server completely *build/zinc-0.3.5.3/bin/zinc -shutdown *will shutdown but it again restarts with the mvn/build command ? *Error in Debug mode :* *[ERROR] Failed to execute goal

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Jia Zhan
Hi, Have tried tried building it successfully without hadoop? $build/mnv -DskiptTests clean package Can you check it build/mvn was started successfully, or it's using your own mvn? Let us know your jdk version as well. On Thu, Oct 29, 2015 at 11:34 PM, Raghuveer Chanda <

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
spark version - spark 1.4.1 my code snippet: String brokers = "ip:port,ip:port"; String topics = "x,y,z"; HashSet TopicsSet = new HashSet(Arrays.asList(topics.split(","))); HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.broker.list", brokers);

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
I can able to read and print few lines. Afterthat i'm getting this exception. Any idea for this ? *Thanks*, On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V wrote: > Hi, > > I'm trying to read from kafka stream and printing it

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
Do you have any special settings, from your code, I don't think it will incur NPE at that place. On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V wrote: > spark version - spark 1.4.1 > > my code snippet: > > String brokers = "ip:port,ip:port"; > String topics = "x,y,z"; >

Re: Save data to different S3

2015-10-30 Thread William Li
I see. Thanks! From: Steve Loughran > Date: Friday, October 30, 2015 at 12:03 PM To: William Li > Cc: "Zhang, Jingyu" >, user

Extending Spark ML LogisticRegression Object

2015-10-30 Thread njoshi
Hi, I am extending Spark ML package locally to include one of the specialized model I need to try. In particular, I am trying to extend the LogisticRegression model with one which takes a custom object Weights as weights, and I am getting the following compilation error could not find implicit

RE: Spark tunning increase number of active tasks

2015-10-30 Thread YI, XIAOCHUAN
HI Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2 (36 data node) with apache spark 1.2 and 1.4 installed. Each node has 64G RAM and 8 cores. We are only able to use <= 72 executors with executor-cores=2 So we are only get 144 active tasks running pyspark programs with pyspark.

RE: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito
It's something due to the columnar compression. I've seen similar intermittent issues when caching DataFrames. "sportingpulse.com" is a value in one of the columns of the DF. From: Ted Yu Sent: ‎10/‎30/‎2015 6:33 PM To: Zhang,

Re: foreachPartition

2015-10-30 Thread Mark Hamstra
The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartition, but I can't >

foreachPartition

2015-10-30 Thread Alex Nastetsky
I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartition(p =>

key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu
There is not a problem in Spark SQL 1.5.1 but the error of "key not found: sportingpulse.com" shown up when I use 1.5.0. I have to use the version of 1.5.0 because that the one AWS EMR support. Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it? Thanks. Caused by:

Re: Stack overflow error caused by long lineage RDD created after many recursions

2015-10-30 Thread Tathagata Das
You have to run some action after rdd.checkpointi() for the checkpointing to actually occur. Have you done that? On Fri, Oct 30, 2015 at 3:10 PM, Panos Str wrote: > Hi all! > > Here's a part of a Scala recursion that produces a stack overflow after > many > recursions. I've

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu
Thanks Silvio and Ted, Can you please let me know how to fix this intermittent issues? Should I wait EMR upgrading to support the Spark 1.5.1 or change my code from DataFrame to normal Spark map-reduce? Regards, Jingyu On 31 October 2015 at 09:40, Silvio Fiorito

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky
Ahh, makes sense. Knew it was going to be something simple. Thanks. On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra wrote: > The closure is sent to and executed an Executor, so you need to be looking > at the stdout of the Executors, not on the Driver. > > On Fri, Oct 30,

Using model saved by MLlib with out creating spark context

2015-10-30 Thread vijuks
I want to load a model saved by a spark machine learning job, in a web application. model.save(jsc.sc(), "myModelPath"); LogisticRegressionModel model = LogisticRegressionModel.load( jsc.sc(),

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
I just did a local test with your code, seems everything is fine, the only difference is that I use the master branch, but I don't think it changes a lot in this part. Do you met any other exceptions or errors beside this one? Probably this is due to other exceptions that makes this system

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
I don't think Spark Streaming supports multiple streaming context in one jvm, you cannot use in such way. Instead you could run multiple streaming applications, since you're using Yarn. 2015年10月30日星期五,Ramkumar V 写道: > I found NPE is mainly because of im using the same

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Adrian Tanase
Its actually a bit tougher as you’ll first need all the years. Also not sure how you would reprsent your “columns” given they are dynamic based on the input data. Depending on your downstream processing, I’d probably try to emulate it with a hash map with years as keys instead of the columns.

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
No. this is the only exception that im getting multiple times in my log. Also i was reading some other topics earlier but im not faced this NPE. *Thanks*, On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao wrote: > I just did a local

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Jörn Franke
What Storage Format? > On 30 Oct 2015, at 12:05, Rex Xiong wrote: > > Hi folks, > > I have a Hive external table with partitions. > Every day, an App will generate a new partition day=-MM-dd stored by > parquet and run add-partition Hive command. > In some cases, we

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
>From the code, I think this field "rememberDuration" shouldn't be null, it will be verified at the start, unless some place changes it's value in the runtime that makes it null, but I cannot image how this happened. Maybe you could add some logs around the place where exception happens if you

Best practises

2015-10-30 Thread Deepak Sharma
Hi I am looking for any blog / doc on the developer's best practices if using Spark .I have already looked at the tuning guide on spark.apache.org. Please do let me know if any one is aware of any such resource. Thanks Deepak

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
I found NPE is mainly because of im using the same JavaStreamingContext for some other kafka stream. if i change the name , its running successfully. how to run multiple JavaStreamingContext in a program ? I'm getting following exception if i run multiple JavaStreamingContext in single file.

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Deng Ching-Mallete
Hi, You seem to be creating a new RDD for each element in your files RDD. What I would suggest is to load and process only one sequence file in your Spark job, then just execute multiple spark jobs to process each sequence file. With regard to your question of where to view the logs inside the

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Deng Ching-Mallete
Yes, it's also possible. Just pass in the sequence files you want to process as a comma-separated list in sc.sequenceFile(). -Deng On Fri, Oct 30, 2015 at 5:46 PM, Vinoth Sankar wrote: > Hi Deng. > > Thanks for the response. > > Is it possible to load sequence files

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito
I don’t believe I have it on 1.5.1. Are you able to test the data locally to confirm, or is it too large? From: "Zhang, Jingyu" > Date: Friday, October 30, 2015 at 7:31 PM To: Silvio Fiorito

CompositeInputFormat in Spark

2015-10-30 Thread Alex Nastetsky
Does Spark have an implementation similar to CompositeInputFormat in MapReduce? CompositeInputFormat joins multiple datasets prior to the mapper, that are partitioned the same way with the same number of partitions, using the "part" number in the file name in each dataset to figure out which file

Performance issues in SSSP using GraphX

2015-10-30 Thread Khaled Ammar
Hi all, I have an interesting behavior from GraphX while running SSSP. I use the stand-alone mode with 16+1 machines, each has 30GB memory and 4 cores. The dataset is 63GB. However, the input for some stages is huge, about 16 TB ! The computation takes very long time. I stopped it. For your

Re: Exception while reading from kafka stream

2015-10-30 Thread Cody Koeninger
Just put them all in one stream and switch processing based on the topic On Fri, Oct 30, 2015 at 6:29 AM, Ramkumar V wrote: > i want to join all those logs in some manner. That's what i'm trying to do. > > *Thanks*, > > > > On

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-10-30 Thread Tom Stewart
I have the following script in a file named test.R: library(SparkR) sc <- sparkR.init(master="yarn-client") sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) showDF(df) sparkR.stop() q(save="no") If I submit this with "sparkR test.R" or "R  CMD BATCH test.R" or

Spark Streaming (1.5.0) flaky when recovering from checkpoint

2015-10-30 Thread David P. Kleinschmidt
I have a Spark Streaming job that runs great the first time around (Elastic MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs but Spark itself seems to be jacked-up in lots of little ways: - Executors, which are normally stable for days, are terminated within a

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-30 Thread karthik kadiyam
Hi Shahid, I played around with spark driver memory too. In the conf file it was set to " --driver-memory 20G " first. When i changed the spark driver maxResultSize from default to 2g ,i changed the driver memory to 30G and tired too. It gave we same error says "bigger than

RE: Pivot Data in Spark and Scala

2015-10-30 Thread Andrianasolo Fanilo
Hey, The question is tricky, here is a possible answer by defining years as keys for a hashmap per client and merging those : import scalaz._ import Scalaz._ val sc = new SparkContext("local[*]", "sandbox") // Create RDD of your objects val rdd = sc.parallelize(Seq( ("A", 2015, 4), ("A",

heap memory

2015-10-30 Thread Younes Naguib
Hi all, I'm running a spark shell: bin/spark-shell --executor-memory 32G --driver-memory 8G I keep getting : 15/10/30 13:41:59 WARN MemoryManager: Total allocation exceeds 95.00% (2,147,483,647 bytes) of heap memory Any help ? Thanks, Younes Naguib Triton Digital | 1440

Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Silvio Fiorito
In the update function you can return None for a key and it will remove it. If you’re restarting your app you can delete your checkpoint directory to start from scratch, rather than continuing from the previous state. From: Sandeep Giri >

Re: Saving RDDs in Tachyon

2015-10-30 Thread Akhil Das
I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile Thanks Best Regards On Fri, Oct 23, 2015 at 7:57 AM, mark wrote: > I have Avro records stored in Parquet files in HDFS. I want to read these > out as an RDD and save that RDD in Tachyon for

Spark 1.5.1 Dynamic Resource Allocation

2015-10-30 Thread Tom Stewart
I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell  --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=4 --conf spark.dynamicAllocation.maxExecutors=12 --conf

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Akhil Das
You can set it to MEMORY_AND_DISK, in this case data will fall back to disk when it exceeds the memory. Thanks Best Regards On Fri, Oct 23, 2015 at 9:52 AM, JoneZhang wrote: > 1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY > Storage Level? >

Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Sandeep Giri
How to we reset the aggregated statistics to null? Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] [image: other site icon]

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-30 Thread Yifan LI
Thanks Deng, Yes, I agree that there is a partition larger than 2GB which caused this exception. But actually in my case it seems to be not-helpful to fix this problem by directly increasing partitioning in sortBy operation. I think the partitioning in sortBy is not balanced, e.g. in my

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Ted Yu
Jone: For #3, consider ask on vendor's mailing list. On Fri, Oct 30, 2015 at 7:11 AM, Akhil Das wrote: > You can set it to MEMORY_AND_DISK, in this case data will fall back to > disk when it exceeds the memory. > > Thanks > Best Regards > > On Fri, Oct 23, 2015 at

Caching causes later actions to get stuck

2015-10-30 Thread Sampo Niskanen
Hi, I'm facing a problem where Spark is able to perform an action on a cached RDD correctly the first time it is run, but running it immediately afterwards (or an action depending on that RDD) causes it to get stuck. I'm using a MongoDB connector for fetching all documents from a collection to

SparkR job with >200 tasks hangs when calling from web server

2015-10-30 Thread rporcio
Hi, I have a web server which can execute R codes using SparkR. The R session is created with the Rscript init.R command where the /init.R/ file contains a sparkR initialization section: /library(SparkR, lib.loc = paste("/opt/Spark/spark-1.5.1-bin-hadoop2.6", "R", "lib", sep = "/")) sc <<-

Re: Save data to different S3

2015-10-30 Thread Steve Loughran
On 30 Oct 2015, at 18:05, William Li > wrote: Thanks for your response. My secret has a back splash (/) so it didn’t work… that's a recurrent problem with the hadoop/java s3 clients. Keep trying to regenerate a secret until you get one that works

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ali Tajeldin EDU
You can take a look at the smvPivot function in the SMV library ( https://github.com/TresAmigosSD/SMV ). Should look for method "smvPivot" in SmvDFHelper ( http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper). You can also perform the pivot on a

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/SPARK-8992 Should be in 1.6? -- Ruslan Dautkhanov On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss wrote: > Hi, > > I have data as follows: > > A, 2015, 4 > A, 2014, 12 > A, 2013, 1 > B, 2015, 24 > B, 2013 4 > > > I need to convert the

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
How about the following ? scala> df.registerTempTable("df") scala> df1.registerTempTable("df1") scala> sql("select customer_id, uri, browser, epoch from df union select customer_id, uri, browser, epoch from df1").show() +---+-+---+-+ |customer_id|

RE: Error building Spark on Windows with sbt

2015-10-30 Thread Judy Nash
I have not had any success building using sbt/sbt on windows. However, I have been able to binary by using maven command directly. From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Sunday, October 25, 2015 12:51 PM To: Ted Yu Cc: User

Stack overflow error caused by long lineage RDD created after many recursions

2015-10-30 Thread Panos Str
Hi all! Here's a part of a Scala recursion that produces a stack overflow after many recursions. I've tried many things but I've not managed to solve it. val eRDD: RDD[(Int,Int)] = ... val oldRDD: RDD[Int,Int]= ... val result = *Algorithm*(eRDD,oldRDD) *Algorithm*(eRDD: RDD[(Int,Int)] ,

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Ted Yu
I searched for sportingpulse in *.scala and *.java files under 1.5 branch. There was no hit. mvn dependency doesn't show sportingpulse either. Is it possible this is specific to EMR ? Cheers On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu wrote: > There is not a

RE: Spark tunning increase number of active tasks

2015-10-30 Thread YI, XIAOCHUAN
Hi Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2 (36 data node) with apache spark 1.2 and 1.4 installed. Each node has 64G RAM and 8 cores. We are only able to use <= 72 executors with executor-cores=2 So we are only get 144 active tasks running pyspark programs with pyspark.

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
No, i dont have any special settings. if i keep only reading line in my code, it's throwing NPE. *Thanks*, On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao wrote: > Do you have any special settings, from your code, I don't think it

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Vinoth Sankar
Hi Deng. Thanks for the response. Is it possible to load sequence files parallely and process each of it in parallel...? Regards Vinoth Sankar On Fri, Oct 30, 2015 at 2:56 PM Deng Ching-Mallete wrote: > Hi, > > You seem to be creating a new RDD for each element in your

Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Rex Xiong
Hi folks, I have a Hive external table with partitions. Every day, an App will generate a new partition day=-MM-dd stored by parquet and run add-partition Hive command. In some cases, we will add additional column to new partitions and update Hive table schema, then a query across new and old

??????Best practises

2015-10-30 Thread huangzheng
I have the same question.anyone help us. -- -- ??: "Deepak Sharma"; : 2015??10??30??(??) 7:23 ??: "user"; : Best practises Hi I am looking for any blog / doc on the

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
In general , i need to consume five different type of logs from kafka in spark. I have different set of topics for each log. How to start five different stream in spark ? *Thanks*, On Fri, Oct 30, 2015 at 4:40 PM, Ramkumar V

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Michael Armbrust
> > We have tried schema merging feature, but it's too slow, there're hundreds > of partitions. > Which version of Spark?

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V
i want to join all those logs in some manner. That's what i'm trying to do. *Thanks*, On Fri, Oct 30, 2015 at 4:57 PM, Saisai Shao wrote: > I don't think Spark Streaming supports multiple streaming context in one > jvm, you

RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T
> Can the driver pull data and then distribute execution? Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary code to read the data on the driver as you normally would if you were writing a single-node application. Once you have the data in a collection on the

Pulling data from a secured SQL database

2015-10-30 Thread Thomas Ginter
I am working in an environment where data is stored in MS SQL Server. It has been secured so that only a specific set of machines can access the database through an integrated security Microsoft JDBC connection. We also have a couple of beefy linux machines we can use to host a Spark cluster

Re: Save data to different S3

2015-10-30 Thread William Li
Thanks for your response. My secret has a back splash (/) so it didn't work... From: "Zhang, Jingyu" > Date: Thursday, October 29, 2015 at 5:16 PM To: William Li > Cc: user

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask for help. In reality one of my dataframes is a cassandra DF. So cassDF.registerTempTable("df1") registers the temp table in a different SQL Context (new CassandraSQLContext(sc)). scala> sql("select customer_id, uri,

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
I see - you were trying to union a non-Cassandra DF with Cassandra DF :-( On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyska wrote: > Not a bad idea I suspect but doesn't help me. I dumbed down the repro to > ask for help. In reality one of my dataframes is a cassandra DF.

Re: how to merge two dataframes

2015-10-30 Thread Silvio Fiorito
Are you able to upgrade to Spark 1.5.1 and Cassandra connector to latest version? It no longer requires a separate CassandraSQLContext. From: Yana Kadiyska > Reply-To: "yana.kadiy...@gmail.com"

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-30 Thread Bryan Jeffrey
Deenar, This worked perfectly - I moved to SQL Server and things are working well. Regards, Bryan Jeffrey On Thu, Oct 29, 2015 at 8:14 AM, Deenar Toraskar wrote: > Hi Bryan > > For your use case you don't need to have multiple metastores. The default > metastore