Last time I checked, Camus doesn't support storing data as parquet, which
is a deal breaker for me. Otherwise it works well for my Kafka topics with
low data volume.
I am currently using spark streaming to ingest data, generate semi-realtime
stats and publish to a dashboard, and dump full dataset
Hi Roberto,
Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69
Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter
I am pretty happy with using pyspark with ipython notebook. The only issue
is that I need to look at the console output or spark ui to track task
progress. I wonder if anyone thought of or better wrote something to
display some progress bars on the same page when I evaluate a cell in ipynb?
I
Hi all,
I implemented a transformation on hdfs files with spark. First tested in
spark-shell (with yarn), I implemented essentially the same logic with a
spark program (scala), built a jar file and used spark-submit to execute it
on my yarn cluster. The weird thing is that spark-submit approach
...@databricks.com:
Not a stupid question! I would like to be able to do this. For now, you
might try writing the data to tachyon http://tachyon-project.org/
instead of HDFS. This is untested though, please report any issues you run
into.
Michael
On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile()
writes everything back to HDFS. I am wondering if it is possible to cache
parquet-format intermediate results in memory, and therefore making spark
sql queries faster.
Thanks.
-Simon
I am slightly confused about the --executor-memory setting. My yarn
cluster has a maximum container memory of 8192MB.
When I specify --executor-memory 8G in my spark-shell, no container can
be started at all. It only works when I lower the executor memory to 7G.
But then, on yarn, I see 2
I have a working set larger than available memory, thus I am hoping to turn
on rdd compression so that I can store more in-memory. Strangely it made no
difference. The number of cached partitions, fraction cached, and size in
memory remain the same. Any ideas?
I confirmed that rdd compression
persistence level is MEMORY_ONLY so
that setting will have no impact.
On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen xche...@gmail.com wrote:
I have a working set larger than available memory, thus I am hoping to
turn on rdd compression so that I can store more in-memory. Strangely it
made
, Xu (Simon) Chen xche...@gmail.com wrote:
I am slightly confused about the --executor-memory setting. My yarn
cluster has a maximum container memory of 8192MB.
When I specify --executor-memory 8G in my spark-shell, no container
can be started at all. It only works when I lower the executor
Maybe your two workers have different assembly jar files?
I just ran into a similar problem that my spark-shell is using a different
jar file than my workers - got really confusing results.
On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote:
Hi,
I am doing join of two RDDs
N/M.. I wrote a HadoopRDD subclass and append one env field of the
HadoopPartition to the value in compute function. It worked pretty well.
Thanks!
On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote:
I don't quite get it..
mapPartitionWithIndex takes a function that maps
/SparkContext.scala#L456
On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com
wrote:
Hello,
A quick question about using spark to parse text-format CSV files stored
on hdfs.
I have something very simple:
sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p
OK, rebuilding the assembly jar file with cdh5 works now...
Thanks..
-Simon
On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote:
That helped a bit... Now I have a different failure: the start up process
is stuck in an infinite loop outputting the following message:
14
Hi folks,
I have a weird problem when using pyspark with yarn. I started ipython as
follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G
When I create a notebook, I can see workers being created and indeed I see
spark UI running on my
(or not).
Cheers,
Andrew
2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:
Hi folks,
I have a weird problem when using pyspark with yarn. I started ipython as
follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
--num-executors 4 --executor-memory 4G
When I
-DskipTests clean
package
Anything that I might have missed?
Thanks.
-Simon
On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen xche...@gmail.com wrote:
1) yes, that sc.parallelize(range(10)).count() has the same error.
2) the files seem to be correct
3) I have trouble at this step
(or not).
Cheers,
Andrew
2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:
Hi folks,
I have a weird problem when using pyspark with yarn. I started ipython as
follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
--num-executors 4 --executor-memory 4G
When I create
which CDH-5 version are you building against?
On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote:
OK, rebuilding the assembly jar file with cdh5 works now...
Thanks..
-Simon
On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com
wrote:
That helped
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html
And my jar file has 70011 files. Fantastic..
On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote:
I asked several people, no one seems to believe that we can do
if you use JDK 6 to compile?
On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen xche...@gmail.com
wrote:
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html
And my jar file has 70011 files. Fantastic..
On Mon, Jun 2, 2014 at 2:34 PM, Xu
with
SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
set-up correctly.
- Patrick
On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
wrote:
Hi all,
I tried a couple ways, but couldn't get it to work..
The following seems to be what the online document
(http
instead of using two named
resource managers? I wonder if somehow the YARN client can't detect
this multi-master set-up.
On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
wrote:
Note that everything works fine in spark 0.9, which is packaged in CDH5:
I
can launch a spark
Hi all,
I tried a couple ways, but couldn't get it to work..
The following seems to be what the online document (
http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting:
SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Hello,
A quick question about using spark to parse text-format CSV files stored on
hdfs.
I have something very simple:
sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p =
(XXX, p[0], p[2]))
Here, I want to replace XXX with a string, which is the current csv
filename for the line.
25 matches
Mail list logo