Re: Using spark streaming to load data from Kafka to HDFS

2015-08-22 Thread Xu (Simon) Chen
Last time I checked, Camus doesn't support storing data as parquet, which is a deal breaker for me. Otherwise it works well for my Kafka topics with low data volume. I am currently using spark streaming to ingest data, generate semi-realtime stats and publish to a dashboard, and dump full dataset

Re: access hdfs file name in map()

2014-08-01 Thread Xu (Simon) Chen
Hi Roberto, Ultimately, the info you need is set here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as HadoopRDDWithEnv, which takes in an additional parameter

Task progress in ipython?

2014-06-26 Thread Xu (Simon) Chen
I am pretty happy with using pyspark with ipython notebook. The only issue is that I need to look at the console output or spark ui to track task progress. I wonder if anyone thought of or better wrote something to display some progress bars on the same page when I evaluate a cell in ipynb? I

performance difference between spark-shell and spark-submit

2014-06-09 Thread Xu (Simon) Chen
Hi all, I implemented a transformation on hdfs files with spark. First tested in spark-shell (with yarn), I implemented essentially the same logic with a spark program (scala), built a jar file and used spark-submit to execute it on my yarn cluster. The weird thing is that spark-submit approach

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen

cache spark sql parquet file in memory?

2014-06-06 Thread Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon

spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
I am slightly confused about the --executor-memory setting. My yarn cluster has a maximum container memory of 8192MB. When I specify --executor-memory 8G in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2

compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression

Re: compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
persistence level is MEMORY_ONLY so that setting will have no impact. On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen xche...@gmail.com wrote: I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made

Re: spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
, Xu (Simon) Chen xche...@gmail.com wrote: I am slightly confused about the --executor-memory setting. My yarn cluster has a maximum container memory of 8192MB. When I specify --executor-memory 8G in my spark-shell, no container can be started at all. It only works when I lower the executor

Re: Join : Giving incorrect result

2014-06-04 Thread Xu (Simon) Chen
Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Hi, I am doing join of two RDDs

Re: access hdfs file name in map()

2014-06-04 Thread Xu (Simon) Chen
N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well. Thanks! On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote: I don't quite get it.. mapPartitionWithIndex takes a function that maps

Re: access hdfs file name in map()

2014-06-03 Thread Xu (Simon) Chen
/SparkContext.scala#L456 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
(or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
-DskipTests clean package Anything that I might have missed? Thanks. -Simon On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen xche...@gmail.com wrote: 1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
(or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
which CDH-5 version are you building against? On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote: OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote: I asked several people, no one seems to believe that we can do

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
if you use JDK 6 to compile? On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen xche...@gmail.com wrote: OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark

spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar

access hdfs file name in map()

2014-05-29 Thread Xu (Simon) Chen
Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p = (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the line.