spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-01-28 Thread Deenar Toraskar
Hi Anyone tried using spark-xml with spark 1.6. I cannot even get the sample book.xml file (wget https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml ) working https://github.com/databricks/spark-xml scala> val df =

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
Any help on this. Regards Sanjiv Singh Mob : +091 9990-447-339 On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh wrote: > Hi Ted , > Its typo. > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote:

Re: Parquet block size from spark-sql cli

2016-01-28 Thread Ted Yu
Have you tried the following (sc is SparkContext)? sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE) On Thu, Jan 28, 2016 at 9:16 AM, ubet wrote: > Can I set the Parquet block size (parquet.block.size) in spark-sql. We are > loading about 80 table partitions

Re: Spark integration with HCatalog (specifically regarding partitions)

2016-01-28 Thread Elliot West
Is this perhaps not currently supported? // TODO: Support persisting partitioned data source relations in Hive compatible format From: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L328 On 25 January 2016 at 19:45,

Re: Stream S3 server to Cassandra

2016-01-28 Thread Alexandr Dzhagriev
Hello Sateesh, I think you can use a file stream, e.g. streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) to create a stream and then process the RDDs as you are doing now. Thanks, Alex. On Thu, Jan 28, 2016 at 10:56 AM, Sateesh Karuturi <

Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
I am trying to understand what will happen when Spark has an exception during processing, especially while streaming. If I have a small code spinet like this: myDStream.foreachRDD { (rdd: RDD[String]) => println(s"processed => [${rdd.collect().toList}]") throw new Exception("User

Parquet block size from spark-sql cli

2016-01-28 Thread ubet
Can I set the Parquet block size (parquet.block.size) in spark-sql. We are loading about 80 table partitions in parallel on 1.5.2 and run OOM. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-block-size-from-spark-sql-cli-tp26097.html Sent from the

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
Adding to it job status at UI : Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle ReadShuffle Write 1 select ename from employeetest(kill )collect at SparkPlan.scala:84

Re: Spark Distribution of Small Dataset

2016-01-28 Thread Kevin Mellott
Hi Phil, The short answer is that there is a driver machine (which handles the distribution of tasks and data) and a number of worker nodes (which receive data and perform the actual tasks). That being said, certain tasks need to be performed on the driver, because they require all of the data.

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-28 Thread Jonathan Kelly
Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago: https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/ On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn wrote: > Hey Daniel, > > Thanks for the response. > > After playing around for a

Repartition taking place for all previous windows even after checkpointing

2016-01-28 Thread Abhishek Anand
Hi All, Can someone help me with the following doubts regarding checkpointing : My code flow is something like follows -> 1) create direct stream from kafka 2) repartition kafka stream 3) mapToPair followed by reduceByKey 4) filter 5) reduceByKeyAndWindow without the inverse function 6)

Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Eran Witkon
Can you name the features that make databricks better than zepplin? Eran On Fri, 29 Jan 2016 at 01:37 Michal Klos wrote: > We use both databricks and emr. We use databricks for our exploratory / > adhoc use cases because their notebook is pretty badass and better than >

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-28 Thread Ted Yu
I am not Scala expert. RDD extends Serializable but doesn't have @SerialVersionUID() annotation. This may explain what you described. One approach is to add @SerialVersionUID so that RDD's have stable serial version UID. Cheers On Thu, Jan 28, 2016 at 1:38 PM, Jason Plurad

Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Rakesh Soni
> > At its core, EMR just launches Spark applications, whereas Databricks is a > higher-level platform that also includes multi-user support, an interactive > UI, security, and job scheduling. > > Specifically, Databricks runs standard Spark applications inside a user’s > AWS account, similar to

Re: How to write a custom window function?

2016-01-28 Thread Benyi Wang
Never mind. GenericUDAFCollectList supports struct in 1.3.0. I modified it and it works in a tricky way. I also found an example HiveWindowFunction. On Thu, Jan 28, 2016 at 12:49 PM, Benyi Wang wrote: > I'm trying to implement a WindowFunction like collect_list, but I

Re: JSON to SQL

2016-01-28 Thread Andrés Ivaldi
Thans for the tip, I've realize about that end I've ended using explode as you said. This is my attempt var res=(df.explode("rows","r") { l: WrappedArray[ArrayBuffer[String]] => l.toList}).select("r") .map { m => m.getList[Row](0) } var u = res.map { m => Row.fromSeq(m.toSeq)

Re: Spark, Mesos, Docker and S3

2016-01-28 Thread Sathish Kumaran Vairavelu
Thank you., I figured it out. I have set executor memory to minimal and it works., Another issue has come.. I have to pass --add-host option while running containers in slave nodes.. Is there any option to pass docker run parameters from spark? On Thu, Jan 28, 2016 at 12:26 PM Mao Geng

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-01-28 Thread Andrés Ivaldi
Hi, could you get it work, tomorrow I'll be using the xml parser also, On windows 7, I'll let you know the results. Regards, On Thu, Jan 28, 2016 at 12:27 PM, Deenar Toraskar wrote: > Hi > > Anyone tried using spark-xml with spark 1.6. I cannot even get the sample

RE: JSON to SQL

2016-01-28 Thread Mohammed Guller
You don’t need Hive for that. The DataFrame class has a method named explode, which provides the same functionality. Here is an example from the Spark API documentation: df.explode("words", "word"){words: String => words.split(" ")} The first argument to the explode method is the name of the

Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Sourav Mazumder
You can also try out IBM's spark as a service in IBM Bluemix. You'll get there all required features for security, multitenancy, notebook, integration with other big data services. You can try that out for free too. Regards, Sourav On Thu, Jan 28, 2016 at 2:10 PM, Rakesh Soni

streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Jesse F Chen
I ran the same streaming application (compiled individually for 1.5.1 and 1.6.0) that processes 5-second tweet batches. I noticed two things: 1. 10% regression in 1.6.0 vs 1.5.1 Spark v1.6.0: 1,564 tweets/s Spark v1.5.1: 1,747 tweets/s 2. 1.6.0 streaming seems to have a memory

Re: Understanding Spark Task failures

2016-01-28 Thread Tathagata Das
That is hard to guarantee by the system, and it is upto the app developer to ensure that this is not . For example, if the data in a message is corrupted, unless the app code is robust towards handling such data, the system will fail every time it retries that app code. On Thu, Jan 28, 2016 at

Problems when applying scheme to RDD

2016-01-28 Thread Andrés Ivaldi
Hello, I'm having an exception when trying to apply a new Scheme to RDD I'm reading an JSON with Databricks spark-csv v1.3.0 after applying some transformations I have RDD with Strings type columns Then I'm trying to apply Scheme where one of the field is Integer then this exception is riced

Re: Broadcast join on multiple dataframes

2016-01-28 Thread Michael Armbrust
Can you provide the analyzed and optimized plans (explain(true)) On Thu, Jan 28, 2016 at 12:26 PM, Srikanth wrote: > Hello, > > I have a use case where one large table has to be joined with several > smaller tables. > I've added broadcast hint for all small tables in the

local class incompatible: stream classdesc serialVersionUID

2016-01-28 Thread Jason Plurad
I've searched through the mailing list archive. It seems that if you try to run, for example, a Spark 1.5.2 program against a Spark 1.5.1 standalone server, you will run into an exception like this: WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0,

Re: streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Ted Yu
bq. The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. >From the information you posted, it seems the above is backwards. BTW [B is byte[], not class B. FYI On Thu, Jan 28, 2016 at 11:49 AM, Jesse F Chen wrote: > I ran the same streaming application

Re: streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Shixiong(Ryan) Zhu
Hey Jesse, Could you provide the operators you using? For the heap dump, it may be not a real memory leak. Since batches started to queue up, the memory usage should increase. On Thu, Jan 28, 2016 at 11:54 AM, Ted Yu wrote: > bq. The total size by class B is 3GB in 1.5.1

Broadcast join on multiple dataframes

2016-01-28 Thread Srikanth
Hello, I have a use case where one large table has to be joined with several smaller tables. I've added broadcast hint for all small tables in the joins. val largeTableDF = sqlContext.read.format("com.databricks.spark.csv") val metaActionDF = sqlContext.read.format("json") val

How to write a custom window function?

2016-01-28 Thread Benyi Wang
I'm trying to implement a WindowFunction like collect_list, but I have to collect a struct. collect_list works only for primitive type. I think I might modify GenericUDAFCollectList, but haven't tried it yet. I'm wondering if there is an example showing how to write a custom WindowFunction in

Re: Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
Hi Tathagata, Thanks for the response. I can add in a try catch myself and handle user exceptions, that's true, so maybe my example wasn't a very good one. I'm more worried about OOM exceptions and other run-time exceptions (that could happen outside my try catch). For example, I have this

Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Michal Klos
We use both databricks and emr. We use databricks for our exploratory / adhoc use cases because their notebook is pretty badass and better than Zeppelin IMHO. We use EMR for our production machine learning and ETL tasks. The nice thing about EMR is you can use applications other than spark.

building spark 1.6.0 fails

2016-01-28 Thread Carlile, Ken
I am attempting to build Spark 1.6.0 from source on EL 6.3, using Oracle jdk 1.8.0.45, Python 2.7.6, and Scala 2.10.3. When I try to issue build/mvn/ -DskipTests clean package, I get the following: [INFO] Using zinc server for incremental compilation [info] Compiling 3 Java sources to

Spark Caching Kafka Metadata

2016-01-28 Thread asdf zxcv
Does Spark cache which kafka topics exist? A service incorrectly assumes all the relevant topics exist, even if they are empty, causing it to fail. Fortunately the service is automatically restarted and by default, kafka creates the topic after it is requested. I'm trying to create the topic if

Data not getting printed in Spark Streaming with print().

2016-01-28 Thread satyajit vegesna
HI All, I am trying to run HdfsWordCount example from github. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala i am using ubuntu to run the program, but dont see any data getting printed after ,

Re: Data not getting printed in Spark Streaming with print().

2016-01-28 Thread Shixiong(Ryan) Zhu
fileStream has a parameter "newFilesOnly". By default, it's true and means processing only new files and ignore existing files in the directory. So you need to ***move*** the files into the directory, otherwise it will ignore existing files. You can also set "newFilesOnly" to false. Then in the

Re: Spark, Mesos, Docker and S3

2016-01-28 Thread Mao Geng
>From my limited knowledge, only limited options such as network mode, volumes, portmaps can be passed through. See https://github.com/apache/spark/pull/3074/files. https://issues.apache.org/jira/browse/SPARK-8734 is open for exposing all docker options to spark. -Mao On Thu, Jan 28, 2016 at

Getting Exceptions/WARN during random runs for same dataset

2016-01-28 Thread Khusro Siddiqui
Hi Everyone, Environment used: Datastax Enterprise 4.8.3 which is bundled with Spark 1.4.1 and scala 2.10.5. I am using Dataframes to query Cassandra, do processing and store the result back into Cassandra. The job is being submitted using spark-submit on a cluster of 3 nodes. While doing so I

Re: building spark 1.6.0 fails

2016-01-28 Thread Ted Yu
I tried the following command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests I didn't encounter the error you mentioned. bq. Using zinc server for incremental compilation Was it possible that zinc was running before you started the

Re: Getting Exceptions/WARN during random runs for same dataset

2016-01-28 Thread Khusro Siddiqui
It is happening on random executors on random nodes. Not on any specific node everytime. Or not happening at all On Thu, Jan 28, 2016 at 7:42 PM, Ted Yu wrote: > Did the UnsupportedOperationException's happen from the executors on all the > nodes or only one node ? > >

Spark Distribution of Small Dataset

2016-01-28 Thread Philip Lee
Hi, Simple Question about Spark Distribution of Small Dataset. Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster. Dataset (format is ORC by Hive) is so small like 1GB, but I copied it to HDFS. 1) if spark-sql run the dataset distributed on HDFS in each machine, what happens

??????Why Spark-sql miss TableScanDesc.FILTER_EXPR_CONF_STR params when I move Hive table to Spark?

2016-01-28 Thread ????????
If we support TableScanDesc.FILTER_EXPR_CONF_STR like hive we may write sql LIKE this select ydb_sex from ydb_example_shu where ydbpartion='20151110' limit 10 select ydb_sex from ydb_example_shu where ydbpartion='20151110' and (ydb_sex='??' or ydb_province='' or ydb_day>='20151217') limit

??????Why Spark-sql miss TableScanDesc.FILTER_EXPR_CONF_STR params when I move Hive table to Spark?

2016-01-28 Thread ????????
we always used Sql like below. select count(*) from ydb_example_shu where ydbpartion='20151110' and (ydb_sex='' or ydb_province='LIAONING' or ydb_day>='20151217') limit 10 Spark don't push down predicates for TableScanDesc.FILTER_EXPR_CONF_STR, which means that every query is full scan can`t

Re: can't find trackStateByKey in 1.6.0 jar?

2016-01-28 Thread Sebastian Piu
That explains it! Thanks :) On Thu, Jan 28, 2016 at 9:52 AM, Tathagata Das wrote: > its been renamed to mapWithState when 1.6.0 was released. :) > > > > On Thu, Jan 28, 2016 at 1:51 AM, Sebastian Piu > wrote: > >> I wanted to give the new

Re: bug for large textfiles on windows

2016-01-28 Thread Christopher Bourez
Dears, I recompiled Spark on Windows, sounds to work better. My problem with Pyspark remains : https://issues.apache.org/jira/browse/SPARK-12261 I do not know how to debug this, sounds to be linked with Pickle, the garbage collector... I would like to clear the Spark context to see if I can gain

Re: can't find trackStateByKey in 1.6.0 jar?

2016-01-28 Thread Tathagata Das
its been renamed to mapWithState when 1.6.0 was released. :) On Thu, Jan 28, 2016 at 1:51 AM, Sebastian Piu wrote: > I wanted to give the new trackStateByKey method a try, but I'm missing > something very obvious here as I can't see it on the 1.6.0 jar. Is there >

Re: GraphX can show graph?

2016-01-28 Thread Sahil Sareen
Try Neo4j for visualization, GraphX does a pretty god job at distributed graph processing. On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A. wrote: > Hi > > I am new to GraphX. I have a simple csv file which I could load and > compute few graph statistics. However, I

can't find trackStateByKey in 1.6.0 jar?

2016-01-28 Thread Sebastian Piu
I wanted to give the new trackStateByKey method a try, but I'm missing something very obvious here as I can't see it on the 1.6.0 jar. Is there anything in particular I have to do or is just maven playing tricks with me? this is the dependency I'm using: org.apache.spark spark-streaming_2.10

“java.io.IOException: Class not found” on long running Streaming application

2016-01-28 Thread Patrick McGloin
I am getting the exception below on a long running Spark Streaming application. The exception could occur after a few minutes, but it may also may not happen for days. This is with pretty consistent input data. I have seen this Jira ticket (

Stream S3 server to Cassandra

2016-01-28 Thread Sateesh Karuturi
Hello Anyone... please help me to how to Stream the XML files from S3 server to cassandra db using Spark Streaming java. presently iam using Spark core to do that job..but problem is i have to to run for every 15 mints.. thats why iam looking for Spark Streaming.

Explaination for info shown in UI

2016-01-28 Thread Sachin Aggarwal
Hi I am executing a streaming wordcount with kafka with one test topic with 2 partition my cluster have three spark executors Each batch is of 10 sec for every batch(ex below * batch time 02:51:00*) I see 3 entry in spark UI , as shown below below my questions:- 1) As label says jobId for

Why Spark-sql miss TableScanDesc.FILTER_EXPR_CONF_STR params when I move Hive table to Spark?

2016-01-28 Thread ????????
Dear spark I am test StorageHandler on Spark-SQL. but i find the TableScanDesc.FILTER_EXPR_CONF_STR is miss ,but i need it ,is three any where i could found it? I really want to get some filter information from Spark Sql, so that I could make a pre filter by my Index ; so where is the

Re: Spark streaming flow control and back pressure

2016-01-28 Thread Lin Zhao
I'm using branch-1.6 built for 2.11 yesterday. Part of my actor receiver that stores data. The log reports millions while the job apparently back pressured according to UI (I. e. 2000 a 10s batch). store((key, msg)) if (storeCount.incrementAndGet() % 10 == 0) { logger.info(s"Stored

Streaming: LeaseExpiredException when writing checkpoint

2016-01-28 Thread Lin Zhao
I'm seeing this error in the driver when running a streaming job. Not sure If it's critical. It happens maybe half of time checkpoint is saved. There are retries in the log but sometimes results in "Could not write checkpoint for time 145400632 ms to file

Re: Spark, Mesos, Docker and S3

2016-01-28 Thread Mao Geng
Sathish, I guess the mesos resources are not enough to run your job. You might want to check the mesos log to figure out why. I tried to run the docker image with "--conf spark.mesos.coarse=false" and "true". Both are fine. Best, Mao On Wed, Jan 27, 2016 at 5:00 PM, Sathish Kumaran Vairavelu <

Re: spark.kryo.classesToRegister

2016-01-28 Thread Jagrut Sharma
I have run into this issue ( https://issues.apache.org/jira/browse/SPARK-10251) with kryo on Spark version 1.4.1. Just something to be aware of when setting config to 'true'. Thanks. -- Jagrut On Thu, Jan 28, 2016 at 6:32 AM, Jim Lohse wrote: > You are only

Setting up data for columnsimilarity

2016-01-28 Thread rcollich
Hi all, I need to be able to find the cosine similarity of a series of vectors (for the sake of arguments let's say that every vector is a tweet). However, I'm having an issue with how I can actually prepare my data to use the Columnsimilarity function. I'm receiving these vectors in row format

RE: Python UDFs

2016-01-28 Thread Stefan Panayotov
Thanks, Jacob. But it seems that Python requires the RETURN Type to be specified. And DenseVector is not a valid return type, or I do not know the correct type to put in. Shall I try ArrayType? Any ideas? Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email:

Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here:

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
Thanks Saisai. I saw following in yarn container logs. I think that killed sparkcontext. 16/01/28 17:38:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]*Unknown/unsupported param List*(--properties-file,

Persisting of DataFrames in transformation workflows

2016-01-28 Thread Gireesh Puthumana
Hi All, I am trying to run a series of transformation over 3 DataFrames. After each transformation, I want to persist DF and save it to text file. The steps I am doing is as follows. *Step0:* Create DF1 Create DF2 Create DF3 Create DF4 (no persist no save yet) *Step1:* Create RESULT-DF1 by

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
Sorry I didn't notice this mail, seems like a wrong cmdline problem, please ignore my previous comment. On Fri, Jan 29, 2016 at 11:58 AM, Nirav Patel wrote: > Thanks Saisai. I saw following in yarn container logs. I think that killed > sparkcontext. > > 16/01/28 17:38:29

How to filter the isolated vertexes in Graphx

2016-01-28 Thread Zhang, Jingyu
I try to filter vertexes that did not have any connection links with others. How to filter those isolated vertexes in Graphx? Thanks, Jingyu -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you

Re: looking for an easy way to count number of rows in JavaDStream

2016-01-28 Thread Andy Davidson
Forgot to mention. The reason I want the count is so that I can reparation my data so that when I save it to disk each file has at 100 rows instead of lots of smaller files Kind regards Andy From: Andrew Davidson Date: Thursday, January 28, 2016 at 6:41 PM To:

Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here:

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
I think I met this problem before, this problem might be due to some race conditions in exit period. The way you mentioned is still valid, this problem only occurs when stopping the application. Thanks Saisai On Fri, Jan 29, 2016 at 10:22 AM, Nirav Patel wrote: > Hi, we

Re: Getting Exceptions/WARN during random runs for same dataset

2016-01-28 Thread Ted Yu
Did the UnsupportedOperationException's happen from the executors on all the nodes or only one node ? Thanks On Thu, Jan 28, 2016 at 5:13 PM, Khusro Siddiqui wrote: > Hi Everyone, > > Environment used: Datastax Enterprise 4.8.3 which is bundled with Spark > 1.4.1 and scala

looking for an easy way to count number of rows in JavaDStream

2016-01-28 Thread Andy Davidson
There must be any easy way to count the number of rows in JavaDStream. JavaDStream words; JavaDStream hardToUse = words(); JavaDStream does not seem to have a collect(). The following works but is very clumsy Any suggestions would be greatly appreciated Andy public class

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-28 Thread Calvin Jia
Hi, Thanks for the detailed information. How large is the dataset you are running against? Also did you change any Tachyon configurations? Thanks, Calvin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-28 Thread Andrew Zurn
Hey Daniel, Thanks for the response. After playing around for a bit, it looks like it's probably the something similar to the first situation you mentioned, with the Parquet format causing issues. Both programmatically created dataset and a dataset pulled off the internet (rather than out of S3

Re: Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-28 Thread Ted Yu
Looks like '--properties-file' is no longer supported. Was it possible that Spark 1.3.1 artifact / dependency leaked into your app ? Cheers On Thu, Jan 28, 2016 at 7:36 PM, Nirav Patel wrote: > Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client >

Re: Spark streaming flow control and back pressure

2016-01-28 Thread Iulian Dragoș
Calling `store` should get you there. What version of Spark are you using? Can you share your code? iulian On Thu, Jan 28, 2016 at 2:28 AM, Lin Zhao wrote: > I have an actor receiver that reads data and calls "store()" to save data > to spark. I was hoping

Re:Hive on Spark knobs

2016-01-28 Thread Todd
Did you run hive on spark with spark 1.5 and hive 1.1? I think hive on spark doesn't support spark 1.5. There are compatibility issues. At 2016-01-28 01:51:43, "Ruslan Dautkhanov" wrote: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Re: spark.kryo.classesToRegister

2016-01-28 Thread Jim Lohse
You are only required to add classes to Kryo (compulsorily) if you use a specific setting: //require registration of all classes with Kyro .set("spark.kryo.registrationRequired","true") Here's an example of my setup, I think this is the best approach because it forces me to really think

Re: Tips for Spark's Random Forest slow performance

2016-01-28 Thread Alexander Ratnikov
Coming back to this I believe I found some reasons. Basically, the main logic sits inside ProbabilisticClassificationModel. It has a transform method which takes a DataFrame (the vector to classify) and appends to it some UDFs which actually do the prediction. The thing is that this DataFrame

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
It working now ... I checked at Spark worker UI , executor startup failing with below error , JVM initialization failing because of wrong -Xms : Invalid initial heap size: -Xms0MError: Could not create the Java Virtual Machine.Error: A fatal exception has occurred. Program will exit. Thrift

Spark streaming and ThreadLocal

2016-01-28 Thread N B
Hello, Does anyone know if there are any potential pitfalls associated with using ThreadLocal variables in a Spark streaming application? One things I have seen mentioned in the context of app servers that use thread pools is that ThreadLocals can leak memory. Could this happen in Spark streaming

Visualization of KMeans cluster in Spark

2016-01-28 Thread Yogesh Vyas
Hi, Is there any way to visualizing the KMeans clusters in spark? Can we connect Plotly with Apache Spark in Java? Thanks, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: