Hi,
I confirm the models are exported for PMML version 4.2, in fact you can see
in the generated xml
PMML xmlns="http://www.dmg.org/PMML-4_2;
This is the default version when using
https://github.com/jpmml/jpmml-model/tree/1.1.X.
I didn't realize the attribute version of the PMML root element
It is implemented with cogroup. Basically it stores states in a separate
RDD and cogroups the target RDD with the state RDD, which is then hidden
from you. See StateDStream.scala, there is everything you need to know.
On Fri, Nov 6, 2015 at 6:25 PM Hien Luu wrote:
> Hi,
>
> I
Andy,
Using the rdd.saveAsTextFile(...) will overwrite the data if your target
is the same file.
If you want to save to HDFS, DStream offers dstream.saveAsTextFiles(prefix,
suffix) where a new file will be written at each streaming interval.
Note that this will result in a saved file for each
Which release of Spark were you using ?
Can you post the command you used to run WordCount ?
Cheers
On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma wrote:
> I am trying to run simple word count job in spark but I am getting
> exception while running job.
>
> For
Hi All,
Spark Version = 1.5.1
Hadoop Version = 2.6.0
I set up the cluster in Amazon EC2 machines (1+5)
I am able create a SparkContext object using *init* method from *RStudio.*
But I do not know how can I create a SparkContext object in *yarn mode.*
I got the below link to run on yarn. but in
Hi
I am using Spark 1.3.0 . Command that I use is below.
/spark-submit --class org.com.td.sparkdemo.spark.WordCount \
--master yarn-cluster \
target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Thanks
Shashi
On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu wrote:
>
Hi All,
I am facing a weird situation which is explained below.
Scenario and Problem: I want to add two attributes to JSON object based on the
look up table values and insert the JSON to Mongo DB. I have broadcast variable
which holds look up table. However, i am not being able to access it
I'm not getting your question about scheduling. Did you create a Spark
application and asking how to schedule it to run? Are you going to output
results from the scheduled run in hdfs and join them in the first chain with
the real time result?
--
View this message in context:
I think it depends on the versions. Using something like 0.9.2 and 1.5.1
isn't recommended.
1.5.0 and 1.5.1 is a minor bug release so I think most will work but some
feature may behave differently so it's better to use the same revision.
Changes between versions/releases are listed in CHANGES.txt
Hi,
I am staring spark thrift server with the following script,
./start-thriftserver.sh --master yarn-client --driver-memory 1G
--executor-memory 2G --driver-cores 2 --executor-cores 2 --num-executors 4
--hiveconf hive.server2.thrift.port=10001 --hiveconf
Hi,
I thought I understood RDDs and DataFrames, but one noob thing is bugging
me (because I'm seeing weird errors involving joins):
*What does Spark do when you pass a big dataframe as an argument to a
function? *
Are these dataframes included in the closure of the function, and is
therefore
You can save the result to a storage (e.g. Hive) and have a web application
read data from that.
I think there's also a "toJSON" method to convert Dataset to JSON.
Another option is to use something like Spark Kernel with Spark
sc(https://github.com/ibm-et/spark-kernel/wiki)
Another choice is to
Hi,
I am trying to cluster words of some articles. I used TFIDF and Word2Vec in
Spark to get the vector for each word and I used KMeans to cluster the
words. Now, is there any way to get back the words from the vectors? I want
to know what words are there in each cluster.
I am aware that TFIDF
Hi, community
We are specially interested about this featural integration according to some
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
seems good implementation for lambda architecure in the open-source world,
especially non-hadoop based cluster environment. As we can see,
It depends on how much data needs to be processed. Data Warehouse with
indexes is going to be faster when there is not much data. If you have big
data, Spark Streaming and may be Spark SQL may interest you.
--
View this message in context:
Are you sure you downloaded the pre-build version? The default is source
build package.
Please check if the file you've downloaded starts "spark-1.5.1-bin-" with a
"bin".
--
View this message in context:
Hi Vincenzo/Owen,
I have sent a pull request[1] with necessary changes to add the pmml
version attribute to the root node. I have also linked the issue under the
PMML improvement umbrella[2] as you suggested.
[1] https://github.com/apache/spark/pull/9558
[2]
Is there any distributor supporting these software components in combination?
If no and your core business is not software then you may want to look for
something else, because it might not make sense to build up internal know-how
in all of these areas.
In any case - it depends all highly on
Hi,
I see a lot of unwanted SysOuts when I try to save an RDD as parquet file.
Following is the code and
SysOuts. Any idea as to how to avoid the unwanted SysOuts?
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job,
Hi,
Thanks for suggesting. Actually we are now evaluating and stressing the spark
sql on cassandra, while
trying to define business models. FWIW, the solution mentioned here is
different from traditional OLAP
cube engine, right ? So we are hesitating on the common sense or direction
choice
Hello,
I am running spark 1.5.1 on EMR using Python 3.
I have a pyspark job which is doing some simple joins and reduceByKey
operations. It works fine most of the time, but sometimes I get the
following error:
15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID
69,
Hi Shashi,
It's possible that the logs you were seeing is the log for the second
attempt. By default, I think yarn is configured to re-attempt executing the
job again if it fails the first time. Try checking the application logs
from the Yarn RM UI, make sure that you click the first log attempt
Hi,
Did you check if HADOOP_CONF_DIR is configured in your YARN's application
classpath? By default, the shell runs in local client mode which is
probably why it's resolving the env variable you're setting and was able to
get the Hive metastore from your hive-site.xml..
HTH,
Deng
On Sun, Nov 8,
Hi
Is it possible to elaborate a little more?
In order to consume a fixed width file, the standard process should be
1. Write a map function which takes input as a string and implement file
specs to return tuple of fields.
2. Load the files using sc.textFile (which reads the lines as string)
3.
Apache Drill is also a very good candidate for this.
On Mon, Nov 9, 2015 at 9:33 AM, Hitoshi Ozawa wrote:
> It depends on how much data needs to be processed. Data Warehouse with
> indexes is going to be faster when there is not much data. If you have big
> data, Spark
This is my first Spark Stream application. The setup is as following
3 nodes running a spark cluster. One master node and two slaves.
The application is a simple java application streaming from twitter and
dependencies managed by maven.
Here is the code of the application
public class
I am working on a modified Spark core and have a Broadcast variable which I
deserialize to obtain an RDD along with its set of dependencies, as is done
in ShuffleMapTask, as following:
val taskBinary: Broadcast[Array[Byte]]var (rdd, dep) =
ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
There's a document describing the format of files in the parent directory. It
seems like a fixed width file.
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf
--
View this message in context:
You included a very old version of the Twitter jar - 1.0.0. Did you mean 1.5.1?
On Mon, Nov 9, 2015 at 7:36 AM, fanooos wrote:
> This is my first Spark Stream application. The setup is as following
>
> 3 nodes running a spark cluster. One master node and two slaves.
>
>
Hi,
I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
Please find below the code and the Warning message. Any idea as to how to
avoid the unwanted Warning message?
activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
30 matches
Mail list logo