Re: PMML version in MLLib

2015-11-08 Thread Vincenzo Selvaggio
Hi, I confirm the models are exported for PMML version 4.2, in fact you can see in the generated xml PMML xmlns="http://www.dmg.org/PMML-4_2; This is the default version when using https://github.com/jpmml/jpmml-model/tree/1.1.X. I didn't realize the attribute version of the PMML root element

Re: Spark Streaming updateStateByKey Implementation

2015-11-08 Thread Zoltán Zvara
It is implemented with cogroup. Basically it stores states in a separate RDD and cogroups the target RDD with the state RDD, which is then hidden from you. See StateDStream.scala, there is everything you need to know. On Fri, Nov 6, 2015 at 6:25 PM Hien Luu wrote: > Hi, > > I

Re: streaming: missing data. does saveAsTextFile() append or replace?

2015-11-08 Thread Gerard Maas
Andy, Using the rdd.saveAsTextFile(...) will overwrite the data if your target is the same file. If you want to save to HDFS, DStream offers dstream.saveAsTextFiles(prefix, suffix) where a new file will be written at each streaming interval. Note that this will result in a saved file for each

Re: Spark Job failing with exit status 15

2015-11-08 Thread Ted Yu
Which release of Spark were you using ? Can you post the command you used to run WordCount ? Cheers On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma wrote: > I am trying to run simple word count job in spark but I am getting > exception while running job. > > For

Connecting SparkR through Yarn

2015-11-08 Thread Amit Behera
Hi All, Spark Version = 1.5.1 Hadoop Version = 2.6.0 I set up the cluster in Amazon EC2 machines (1+5) I am able create a SparkContext object using *init* method from *RStudio.* But I do not know how can I create a SparkContext object in *yarn mode.* I got the below link to run on yarn. but in

Re: Spark Job failing with exit status 15

2015-11-08 Thread Shashi Vishwakarma
Hi I am using Spark 1.3.0 . Command that I use is below. /spark-submit --class org.com.td.sparkdemo.spark.WordCount \ --master yarn-cluster \ target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar Thanks Shashi On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu wrote: >

Broadcast Variables not showing inside Partitions Apache Spark

2015-11-08 Thread prajwol sangat
Hi All, I am facing a weird situation which is explained below. Scenario and Problem: I want to add two attributes to JSON object based on the look up table values and insert the JSON to Mongo DB. I have broadcast variable which holds look up table. However, i am not being able to access it

Re: Scheduling Spark process

2015-11-08 Thread Hitoshi Ozawa
I'm not getting your question about scheduling. Did you create a Spark application and asking how to schedule it to run? Are you going to output results from the scheduled run in hdfs and join them in the first chain with the real time result? -- View this message in context:

Re: Does the Standalone cluster and Applications need to be same Spark version?

2015-11-08 Thread Hitoshi Ozawa
I think it depends on the versions. Using something like 0.9.2 and 1.5.1 isn't recommended. 1.5.0 and 1.5.1 is a minor bug release so I think most will work but some feature may behave differently so it's better to use the same revision. Changes between versions/releases are listed in CHANGES.txt

How to use --principal and --keytab in SparkSubmit

2015-11-08 Thread Todd
Hi, I am staring spark thrift server with the following script, ./start-thriftserver.sh --master yarn-client --driver-memory 1G --executor-memory 2G --driver-cores 2 --executor-cores 2 --num-executors 4 --hiveconf hive.server2.thrift.port=10001 --hiveconf

passing RDDs/DataFrames as arguments to functions - what happens?

2015-11-08 Thread Kristina Rogale Plazonic
Hi, I thought I understood RDDs and DataFrames, but one noob thing is bugging me (because I'm seeing weird errors involving joins): *What does Spark do when you pass a big dataframe as an argument to a function? * Are these dataframes included in the closure of the function, and is therefore

Re: visualizations using the apache spark

2015-11-08 Thread Hitoshi Ozawa
You can save the result to a storage (e.g. Hive) and have a web application read data from that. I think there's also a "toJSON" method to convert Dataset to JSON. Another option is to use something like Spark Kernel with Spark sc(https://github.com/ibm-et/spark-kernel/wiki) Another choice is to

Clustering of Words

2015-11-08 Thread Deep Pradhan
Hi, I am trying to cluster words of some articles. I used TFIDF and Word2Vec in Spark to get the vector for each word and I used KMeans to cluster the words. Now, is there any way to get back the words from the vectors? I want to know what words are there in each cluster. I am aware that TFIDF

OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, community We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka) seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see,

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-11-08 Thread Hitoshi Ozawa
It depends on how much data needs to be processed. Data Warehouse with indexes is going to be faster when there is not much data. If you have big data, Spark Streaming and may be Spark SQL may interest you. -- View this message in context:

Re: why prebuild spark 1.5.1 still say Failed to find Spark assembly in

2015-11-08 Thread Hitoshi Ozawa
Are you sure you downloaded the pre-build version? The default is source build package. Please check if the file you've downloaded starts "spark-1.5.1-bin-" with a "bin". -- View this message in context:

Re: PMML version in MLLib

2015-11-08 Thread Fazlan Nazeem
Hi Vincenzo/Owen, I have sent a pull request[1] with necessary changes to add the pmml version attribute to the root node. I have also linked the issue under the PMML improvement umbrella[2] as you suggested. [1] https://github.com/apache/spark/pull/9558 [2]

Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread Jörn Franke
Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas. In any case - it depends all highly on

Unwanted SysOuts in Spark Parquet

2015-11-08 Thread swetha
Hi, I see a lot of unwanted SysOuts when I try to save an RDD as parquet file. Following is the code and SysOuts. Any idea as to how to avoid the unwanted SysOuts? ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) AvroParquetOutputFormat.setSchema(job,

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP cube engine, right ? So we are hesitating on the common sense or direction choice

PySpark: cannot convert float infinity to integer, when setting batch in add_shuffle_key

2015-11-08 Thread trsell
Hello, I am running spark 1.5.1 on EMR using Python 3. I have a pyspark job which is doing some simple joins and reduceByKey operations. It works fine most of the time, but sometimes I get the following error: 15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID 69,

Re: Spark Job failing with exit status 15

2015-11-08 Thread Deng Ching-Mallete
Hi Shashi, It's possible that the logs you were seeing is the log for the second attempt. By default, I think yarn is configured to re-attempt executing the job again if it fails the first time. Try checking the application logs from the Yarn RM UI, make sure that you click the first log attempt

Re: sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

2015-11-08 Thread Deng Ching-Mallete
Hi, Did you check if HADOOP_CONF_DIR is configured in your YARN's application classpath? By default, the shell runs in local client mode which is probably why it's resolving the env variable you're setting and was able to get the Hive metastore from your hive-site.xml.. HTH, Deng On Sun, Nov 8,

Re: How to analyze weather data in Spark?

2015-11-08 Thread ayan guha
Hi Is it possible to elaborate a little more? In order to consume a fixed width file, the standard process should be 1. Write a map function which takes input as a string and implement file specs to return tuple of fields. 2. Load the files using sc.textFile (which reads the lines as string) 3.

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-11-08 Thread chandan prakash
Apache Drill is also a very good candidate for this. On Mon, Nov 9, 2015 at 9:33 AM, Hitoshi Ozawa wrote: > It depends on how much data needs to be processed. Data Warehouse with > indexes is going to be faster when there is not much data. If you have big > data, Spark

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread fanooos
This is my first Spark Stream application. The setup is as following 3 nodes running a spark cluster. One master node and two slaves. The application is a simple java application streaming from twitter and dependencies managed by maven. Here is the code of the application public class

Wrap an RDD with a ShuffledRDD

2015-11-08 Thread Muhammad Haseeb Javed
I am working on a modified Spark core and have a Broadcast variable which I deserialize to obtain an RDD along with its set of dependencies, as is done in ShuffleMapTask, as following: val taskBinary: Broadcast[Array[Byte]]var (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](

Re: How to analyze weather data in Spark?

2015-11-08 Thread Hitoshi Ozawa
There's a document describing the format of files in the parent directory. It seems like a fixed width file. ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf -- View this message in context:

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread Sean Owen
You included a very old version of the Twitter jar - 1.0.0. Did you mean 1.5.1? On Mon, Nov 9, 2015 at 7:36 AM, fanooos wrote: > This is my first Spark Stream application. The setup is as following > > 3 nodes running a spark cluster. One master node and two slaves. > >

parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-08 Thread swetha
Hi, I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. Please find below the code and the Warning message. Any idea as to how to avoid the unwanted Warning message? activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], classOf[ActiveSession],