Re: local class incompatible: stream classdesc serialVersionUID

2016-01-29 Thread Jason Plurad
I agree with you, Ted, if RDD had a serial version UID this might not be an issue. So that could be a JIRA to submit to help avoid version mismatches in future Spark versions, but that doesn't help my current situation between 1.5.1 and 1.5.2. Any other ideas? Thanks. On Thu, Jan 28, 2016 at 5:06

Re: ZlibFactor warning

2016-01-29 Thread Ted Yu
Did the stack trace look like the one from: https://issues.apache.org/jira/browse/HADOOP-12638 Cheers > On Jan 27, 2016, at 1:29 AM, Eli Super wrote: > > > Hi > > I'm running spark locally on win 2012 R2 server > > No hadoop installed > > I'm getting following error :

Re: building spark 1.6.0 fails

2016-01-29 Thread Sean Owen
You're somehow building with Java 6. At least this is what the error means. On Fri, Jan 29, 2016, 05:25 Carlile, Ken wrote: > I am attempting to build Spark 1.6.0 from source on EL 6.3, using Oracle > jdk 1.8.0.45, Python 2.7.6, and Scala 2.10.3. When I try to issue >

Re: Spark Algorithms as WEB Application

2016-01-29 Thread Ted Yu
Have you looked at: http://wiki.apache.org/tomcat/OutOfMemory Cheers > On Jan 29, 2016, at 2:44 AM, rahulganesh wrote: > > Hi, > I am currently working on a web application which will call the spark mllib > algorithms using JERSEY ( REST API ). The problem that i am

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Terry Hoo
Yes, the data is stored in driver memory. Mehdi Ben Haj Abbes 于2016年1月29日星期五 18:13写道: > Thanks Terry for the quick answer. > > I did not tried it. Lets say I will increase the value to 2, what > side effect should I expect. In fact in the explanation of the property

Spark Algorithms as WEB Application

2016-01-29 Thread rahulganesh
Hi, I am currently working on a web application which will call the spark mllib algorithms using JERSEY ( REST API ). The problem that i am facing is that i am frequently getting permgen space Java out of memory exception and also i am not able to save decision tree models using

mapWithState / stateSnapshots() yielding empty rdds?

2016-01-29 Thread Sebastian Piu
Hi All, I'm playing with the new mapWithState functionality but I can't get it quite to work yet. I'm doing two print() calls on the stream: 1. after mapWithState() call, first batch shows results - next batches yield empty 2. after stateSnapshots(), always yields an empty RDD Any pointers on

Re: streaming textFileStream problem - got only ONE line

2016-01-29 Thread patcharee
I moved them every interval to the monitored directory. Patcharee On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote: Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-29 Thread cc
Hey, Jia Zou I'm curious about this exception, the error log you showed that the exception is related to unlockBlock, could you upload your full master.log and worker.log under tachyon/logs directory? Best, Cheng 在 2016年1月29日星期五 UTC+8上午11:11:19,Calvin Jia写道: > > Hi, > > Thanks for the

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes
Thanks Terry for the quick answer. I did not tried it. Lets say I will increase the value to 2, what side effect should I expect. In fact in the explanation of the property "How many finished batches the Spark UI and status APIs remember before garbage collecting." So the data is stored in

Re: GraphX can show graph?

2016-01-29 Thread Balachandar R.A.
Thanks... Will look into that - Bala On 28 January 2016 at 15:36, Sahil Sareen wrote: > Try Neo4j for visualization, GraphX does a pretty god job at distributed > graph processing. > > On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A. < > balachandar...@gmail.com> wrote:

Re: Spark GraphX + TitanDB + Cassandra?

2016-01-29 Thread Nick Pentreath
Hi Joe A while ago I was running a Titan + HBase datastore to store graph data. I then used Spark (via TitanHBaseInputFormat, you could use the Cassandra version) to access a RDD[Vertex] that I performed analytics and machine learning on. That could form the basis of putting the data into a form

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Terry Hoo
Hi Mehdi, Do you try a larger value of "spark.streaming.ui.retainedBatches"(default is 1000)? Regards, - Terry On Fri, Jan 29, 2016 at 5:45 PM, Mehdi Ben Haj Abbes wrote: > Hi folks, > > I have a streaming job running for more than 24 hours. It seems that there > is a

Re: mapWithState / stateSnapshots() yielding empty rdds?

2016-01-29 Thread Sebastian Piu
Just saw I'm not calling state.update() in my trackState function. I guess that is the issue! On Fri, Jan 29, 2016 at 9:36 AM, Sebastian Piu wrote: > Hi All, > > I'm playing with the new mapWithState functionality but I can't get it > quite to work yet. > > I'm

Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes
Hi folks, I have a streaming job running for more than 24 hours. It seems that there is a limit on the number of the batches displayed in the Streaming Statics visualization screen. For example if I would launch a job Friday I will not be able to have the statistics for what happened during

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-29 Thread Daniel Darabos
Hi Andrew, If you still see this with Spark 1.6.0, it would be very helpful if you could file a bug about it at https://issues.apache.org/jira/browse/SPARK with as much detail as you can. This issue could be a nasty source of silent data corruption in a case where some intermediate data loses 8

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-29 Thread Ted Yu
I logged SPARK-13084 For the moment, please consider running with 1.5.2 on all the nodes. On Fri, Jan 29, 2016 at 5:29 AM, Jason Plurad wrote: > I agree with you, Ted, if RDD had a serial version UID this might not be > an issue. So that could be a JIRA to submit to help

Spark Streaming from existing RDD

2016-01-29 Thread Sateesh Karuturi
Anyone please help me out how to create a DStream from existing RDD. My code is: JavaSparkContext ctx = new JavaSparkContext(conf);JavaRDD rddd = ctx.parallelize(arraylist); Now i need to use these *rddd* as input to *JavaStreamingContext*.

Pyspark filter not empty

2016-01-29 Thread patcharee
Hi, In pyspark how to filter if a column of dataframe is not empty? I tried: dfNotEmpty = df.filter(df['msg']!='') It did not work. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-29 Thread Iulian Dragoș
On Fri, Jan 29, 2016 at 5:22 PM, Iulian Dragoș wrote: > I found the issue in the 2.11 version of the REPL, PR will follow shortly. > https://github.com/apache/spark/pull/10984 > > The 2.10 version of Spark doesn't have this issue, so you could use that > in the

Getting Exceptions/WARN during random runs for same dataset

2016-01-29 Thread Khusro Siddiqui
Hi Everyone, Environment used: Datastax Enterprise 4.8.3 which is bundled with Spark 1.4.1 and scala 2.10.5. I am using Dataframes to query Cassandra, do processing and store the result back into Cassandra. The job is being submitted using spark-submit on a cluster of 3 nodes. While doing so I

Re: Hive on Spark knobs

2016-01-29 Thread Ruslan Dautkhanov
Yep, I tried that. It seems you're right. Got an error that execution engine has to be set to mr. hive.execution.engine = mr I did not keep exact error message/stack. It's probably disabled explicitly. -- Ruslan Dautkhanov On Thu, Jan 28, 2016 at 7:03 AM, Todd wrote: > Did

Re: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-29 Thread Iulian Dragoș
I found the issue in the 2.11 version of the REPL, PR will follow shortly. The 2.10 version of Spark doesn't have this issue, so you could use that in the mean time. iulian On Wed, Jan 27, 2016 at 3:17 PM, wrote: > So far, still cannot find a way of running a

Re: Spark Caching Kafka Metadata

2016-01-29 Thread Cody Koeninger
The kafka direct stream doesn't do any explicit caching. I haven't looked through the underlying simple consumer code in the kafka project in detail, but I doubt it does either. Honestly, I'd recommend not using auto created topics (it makes it too easy to pollute your topics if someone

Re: Spark Streaming from existing RDD

2016-01-29 Thread Shixiong(Ryan) Zhu
Do you just want to write some unit tests? If so, you can use "queueStream" to create a DStream from a queue of RDDs. However, because it doesn't support metadata checkpointing, it's better to only use it in unit tests. On Fri, Jan 29, 2016 at 7:35 AM, Sateesh Karuturi <

Re: Broadcast join on multiple dataframes

2016-01-29 Thread Srikanth
Micheal, Output of DF.queryExecution is saved to https://www.dropbox.com/s/1vizuwpswza1e3x/plan.txt?dl=0 I don't see anything in this to suggest a switch in strategy. Hopefully you find this helpful. Srikanth On Thu, Jan 28, 2016 at 4:43 PM, Michael Armbrust wrote: >

Re: Spark, Mesos, Docker and S3

2016-01-29 Thread Mao Geng
Sathish, The constraint you described is Marathon's, not Mesos's :) Spark.mesos.constraints are applied to slave attributes like tachyon=true ;us-east-1=false, as described in https://issues.apache.org/jira/browse/SPARK-6707. Cheers, -Mao On Fri, Jan 29, 2016 at 2:51 PM, Sathish Kumaran

Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Well looking at the src it look like its not implemented: https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36 -- View this message in context:

Reading multiple avro files from a dir - Spark 1.5.1

2016-01-29 Thread Ajinkya Kale
Trying to load avro from hdfs. I have around 1000 part avro files in a dir. I am using this to read them - val df = sqlContext.read.format("com.databricks.spark.avro").load("path/to/avro/dir") df.select("QUERY").take(50).foreach(println) It works if I have pass only 1or 2 avro files in the

Re: stopping spark stream app

2016-01-29 Thread agateaaa
Hi, We recently started working on trying to use spark streaming to fetch and process data from kafka. (Direct Streaming, Not Receiver based Spark 1.5.2) We want to be able to stop the streaming application and tried implementing the approach suggested above, using stopping thread and calling

Re: How to use DStream reparation() ?

2016-01-29 Thread Andy Davidson
The following code seems to do what I want. I repartition on RDD not DStreams. I wonder if this has to do with the way windows work? private static void saveTweetsCSV(JavaSparkContext jsc, JavaDStream tidy, String outputURI) { tidy.foreachRDD(new VoidFunction2,

Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Fixed a typo in the code to avoid any confusion Please comment on the code below... dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); } }; somefunc(p, d.get()); d.remove(); return p; }; ); On Fri, Jan

How to use DStream reparation() ?

2016-01-29 Thread Andy Davidson
My Streaming app has a requirement that my output be saved in the smallest number of file possible such that each file does not exceed a max number of rows. Based on my experience it appears that each partition will be written to separate output file. This was really easy to do in my batch

Garbage collections issue on MapPartitions

2016-01-29 Thread rcollich
Hi all, I currently have a mapPartitions job which is flatMapping each value in the iterator, and I'm running into an issue where there will be major GC costs on certain executions. Some executors will take 20 minutes, 15 of which are pure garbage collection, and I believe that a lot of it has to

GoogleAnalytics GAData

2016-01-29 Thread Andrés Ivaldi
Hello , Im using Google api to retrive google analytics JSON I'd like to use Spark to load the JSON, but toString truncates the value, I could save it to disk and then retrive it, butI'm loosing performance, is there any other way? Regars -- Ing. Ivaldi Andres

Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
So this use of ThreadLocal will be inside the code of a function executing on the workers i.e. within a call from one of the lambdas. Would it just look like this then: dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); }

Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Well won't the code in lambda execute inside multiple threads in the worker because it has to process many records? I would just want to have a single copy of SomeClass instantiated per thread rather than once per each record being processed. That was what triggered this thought anyways. Thanks

Re: Spark, Mesos, Docker and S3

2016-01-29 Thread Sathish Kumaran Vairavelu
Hi Quick question. How to pass constraint [["hostname", "CLUSTER", " specific.node.com"]] to mesos? I was trying --conf spark.mesos.constraints=hostname:specific.node.com. But it didn't seems working Please help Thanks Sathish On Thu, Jan 28, 2016 at 6:52 PM Mao Geng

saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Everyone, We are using spark 1.4.1 and we have a requirement of writing data local fs instead of hdfs. When trying to save rdd to local fs with saveAsTextFile, it is just writing _SUCCESS file in the folder with no part- files and also no error or warning messages on console. Is there any

Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Hello, ​ I have managed to speed up the read stage when loading CSV files using the classic "newAPIHadoopFile" method, the issue is that I would like to use the spark-csv package and it seams that its not taking into consideration the LZO Index file / Splittable reads. /# Using the classic method

[MLlib] What is the best way to forecast the next month page visit?

2016-01-29 Thread diplomatic Guru
Hello guys, I'm trying understand how I could predict the next month page views based on the previous access pattern. For example, I've collected statistics on page views: e.g. Page,UniqueView - pageA, 1 pageB, 999 ... pageZ,200 I aggregate the statistics monthly.

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? You don't need to use ThreadLocal if there are no multiple threads in your codes. On Fri, Jan 29, 2016 at 4:39 PM, N B wrote: > Fixed a typo in the code to avoid any confusion Please comment on the

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
I see. Then you should use `mapPartitions` rather than using ThreadLocal. E.g., dstream.mapPartitions( iter -> val d = new SomeClass(); return iter.map { p => somefunc(p, d.get()) }; }; ); On Fri, Jan 29, 2016 at 5:29 PM, N B wrote: > Well won't the

RE: saveAsTextFile is not writing to local fs

2016-01-29 Thread Mohammed Guller
Is it a multi-node cluster or you running Spark on a single machine? You can change Spark’s logging level to INFO or DEBUG to see what is going on. Mohammed Author: Big Data Analytics with Spark From: Siva

Re: saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Mohammed, Thanks for your quick response. I m submitting spark job to Yarn in "yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. I see the below exception, but this exception occurred after saveAsTextfile function is finished. 16/01/29 20:26:57 DEBUG HttpParser:

mapWithState: multiple operations on the same stream

2016-01-29 Thread Udo Fholl
Hi, I have a stream which I need to process events and send them to another service and then remove the key from the state. I'm storing state because I some events are delayed. My current approach is to consolidate the state, store it with a mapWithState invocation. Then rather than using a

Re: mapWithState: remove key

2016-01-29 Thread Shixiong(Ryan) Zhu
1. To remove a state, you need to call "state.remove()". If you return a None in the function, it just means don't output it as the DStream's output, but the state won't be removed if you don't call "state.remove()". 2. For NoSuchElementException, here is the doc for "State.get": /** * Get

mapWithState: remove key

2016-01-29 Thread Udo Fholl
Hi, >From the signature of the "mapWithState" method I infer that by returning a "None.type" (in Scala) the key is removed from the state. Is that so? Sorry if it is in the docs, but it wasn't entirely clear to me. I'm chaining operations and calling "mapWithState" twice (one to consolidate,

Re: Spark 2.0.0 release plan

2016-01-29 Thread Mark Hamstra
https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > >

Re: Getting Exceptions/WARN during random runs for same dataset

2016-01-29 Thread Shixiong(Ryan) Zhu
It's a known issue. See https://issues.apache.org/jira/browse/SPARK-10719 On Thu, Jan 28, 2016 at 5:44 PM, Khusro Siddiqui wrote: > It is happening on random executors on random nodes. Not on any specific > node everytime. > Or not happening at all > > On Thu, Jan 28, 2016 at

Re: GraphX can show graph?

2016-01-29 Thread Russell Jurney
Maybe checkout Gephi. It is a program that does what you need out of the box. On Friday, January 29, 2016, Balachandar R.A. wrote: > Thanks... Will look into that > > - Bala > > On 28 January 2016 at 15:36, Sahil Sareen

How to control the number of files for dynamic partition in Spark SQL?

2016-01-29 Thread Benyi Wang
I want to insert into a partition table using dynamic partition, but I don’t want to have 200 files for a partition because the files will be small for my case. sqlContext.sql( """ |insert overwrite table event |partition(eventDate) |select | user, | detail, | eventDate

Re: Spark 2.0.0 release plan

2016-01-29 Thread Jakob Odersky
I'm not an authoritative source but I think it is indeed the plan to move the default build to 2.11. See this discussion for more detail http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html On Fri, Jan 29, 2016 at 11:43 AM, Deenar Toraskar

Re: Spark 2.0.0 release plan

2016-01-29 Thread Deenar Toraskar
A related question. Are the plans to move the default Spark builds to Scala 2.11 with Spark 2.0? Regards Deenar On 27 January 2016 at 19:55, Michael Armbrust wrote: > We do maintenance releases on demand when there is enough to justify doing > one. I'm hoping to cut

Re: Spark 2.0.0 release plan

2016-01-29 Thread Michael Armbrust
Its already underway: https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > >

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
Of cause. If you use a ThreadLocal in a long living thread and forget to remove it, it's definitely a memory leak. On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: > Hello, > > Does anyone know if there are any potential pitfalls associated with using > ThreadLocal variables in

Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Thanks for the response Ryan. So I would say that it is in fact the purpose of a ThreadLocal i.e. to have a copy of the variable as long as the thread lives. I guess my concern is around usage of threadpools and whether Spark streaming will internally create many threads that rotate between tasks

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
Spark Streaming uses threadpools so you need to remove ThreadLocal when it's not used. On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: > Thanks for the response Ryan. So I would say that it is in fact the > purpose of a ThreadLocal i.e. to have a copy of the variable as long