My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the
data created by said application to do manipulation within
I guess you could write a custom RDD that can read data from a memory-mapped
file - not really my area of expertise so I’ll leave it to other members of the
forum to chip in with comments as to whether that makes sense.
But if you want ‘fancy analytics’ then won’t the processing time more than
Hi
We are trying to change our existing oozie workflows to use SparkAction
instead of ShellAction.
We are passing spark configuration in spark-opts with --conf, but these
values are not accessible in Spark and it is throwing error.
Please note we are able to use SparkAction successfully in
Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS.
What you're requesting, reading and writing to a C++ process, is not part of
that requirement.
On Monday, December 7, 2015 1:42 PM, Jia wrote:
Thanks, Annabel, but I may need
This is a very helpful article. Thanks for the help.
Ningjun
From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Monday, December 07, 2015 12:42 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to create dataframe from SQL Server SQL query
Hi Ningjun,
Haven't done
Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it.
Have a look at the wide variety of connectors to things like Cassandra, HBase,
etc.
Robin
Sent from my iPhone
> On 7 Dec 2015, at 18:50, Annabel Melongo wrote:
>
> Jia,
>
>
Hi,
How to do a maven build to enable monitoring using Ganglia? What is the
command for the same?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html
Sent from the
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it
turns out there are a number of duplicate JSON objects in the source data. I am
trying to find the best way to remove these duplicates before using the
dataframe.
With both df.dropDuplicates() and
I'm not a python expert, so I'm wondering if anybody has a working
example of a partitioner for the "partitionFunc" argument (default
"portable_hash") to rdd.partitionBy()?
-
To unsubscribe, e-mail:
101 - 109 of 109 matches
Mail list logo