Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng
Solution: sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "...") sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "...") Got this solution from a cloudera lady. Thanks Neerja. -- View this message in context:

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng
Any one, please? I believe many of us are using spark 1.6 or higher with s3... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27451.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
tried the following. still failed the same way.. it ran in yarn. cdh5.8.0 from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('s3 ---') sc = SparkContext(conf=conf) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "...")

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
BTW, I also tried yarn. Same error. When I ran the script, I used the real credentials for s3, which is omitted in this post. sorry about that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27425.html Sent from

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
The question is, what is the cause of the problem? and how to fix it? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27424.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark 1.6.0 read s3 files error.

2016-07-27 Thread freedafeng
cdh 5.7.1. pyspark. codes: === from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('s3 ---') sc = SparkContext(conf=conf) myRdd = sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz") count =

Re: build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but still get the same error. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package -- View this message in context:

build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
jdk: 1.8.0_77 scala: 2.10.4 mvn: 3.3.9. Slightly changed the pom.xml: $ diff pom.xml pom.original 130c130 < 2.6.0-cdh5.7.0-SNAPSHOT --- > 2.2.0 133c133 < 1.2.0-cdh5.7.0-SNAPSHOT --- > 0.98.7-hadoop2 command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0

spark's behavior about failed tasks

2015-08-12 Thread freedafeng
Hello there, I have a spark running in a 20 node cluster. The job is logically simple, just a mapPartition and then sum. The return value of the mapPartitions is an integer for each partition. The tasks got some random failure (which could be caused by a 3rh party key-value store connections. The

Lots of fetch failures on saveAsNewAPIHadoopDataset PairRDDFunctions

2015-03-13 Thread freedafeng
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3 cores and 16G memory were assigned to spark in each worker node. Standalone mode. Data set is 3.8 T. wondering how to fix this. Thanks!

correct way to broadcast a variable

2015-02-12 Thread freedafeng
Suppose I have an object to broadcast and then use it in a mapper function, sth like follows, (Python codes) obj2share = sc.broadcast(Some object here) someRdd.map(createMapper(obj2share)).collect() The createMapper function will create a mapper function using the shared object's value. Another

What could cause number of tasks to go down from 2k to 1?

2015-01-29 Thread freedafeng
Hi, The input data has 2048 partitions. The final step is to load the processed data into hbase through saveAsNewAPIHadoopDataset(). Every step except the last one ran in parallel in the cluster. But the last step only has 1 task which runs on only 1 node using one core. Spark 1.1.1. +

large data set to get rid of exceeds Integer.MAX_VALUE error

2015-01-26 Thread freedafeng
Hi, This seems to be a known issue (see here: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html) The data set is about 1.5 T bytes. There are 14 region servers. I am not sure how many regions there are for this data set. But very likely

Re: Testing if an RDD is empty?

2015-01-15 Thread freedafeng
I think Sampo's thought is to get a function that only tests if a RDD is empty. He does not want to know the size of the RDD, and getting the size of a RDD is expensive for large data sets. I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not

how to run python app in yarn?

2015-01-14 Thread freedafeng
A cdh5.3.0 with spark is set up. just wondering how to run a python application on it. I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the error, Error: Cluster deploy mode is currently not supported for python applications. Run with --help for usage help or --verbose for

Re: how to run python app in yarn?

2015-01-14 Thread freedafeng
Got help from Marcelo and Josh. Now it is running smoothly. In case you need this info - Just use yarn-client instead of yarn-cluster Thanks folks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html Sent from

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I installed the custom as a standalone mode as normal. The master and slaves started successfully. However, I got error when I ran a job. It seems to me from the error message the some library was compiled against hadoop1, but my spark was compiled against hadoop2. 15/01/08 23:27:36 INFO

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to run spark in cdh5.3.0 using its newAPIHadoopRDD? command: spark-submit --master spark://master:7077 --jars /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar

correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
Could anyone come up with your experience on how to do this? I have created a cluster and installed cdh5.3.0 on it with basically core + Hbase. but cloudera installed and configured the spark in its parcels anyway. I'd like to install our custom spark on this cluster to use the hadoop and hbase

spark 1.1 got error when working with cdh5.3.0 standalone mode

2015-01-07 Thread freedafeng
Hi, I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually installed spark1.1 in it. but when I started the slaves, I got an error as follows, ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 Error: Could not find or load main class

executor logging management from python

2014-12-02 Thread freedafeng
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark apps in standalone mode. The default log info goes to /$spark_folder/work/. This folder is in the 10G root fs. So it won't take long to fill up the whole fs. My goal is 1. move the logging location to /mnt, where we

Re: executor logging management from python

2014-12-02 Thread freedafeng
cat spark-env.sh -- #!/usr/bin/env bash export SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time -Dspark.executor.logs.rolling.time.interval=daily -Dspark.executor.logs.rolling.maxRetainedFiles=3 export SPARK_LOCAL_DIRS=/mnt/spark export SPARK_WORKER_DIR=/mnt/spark -- But

logging in workers for pyspark

2014-11-20 Thread freedafeng
Hi, I am wondering how to write logging info in a worker when running a pyspark app. I saw the thread http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html but did not see a solution. Anybody know a solution? Thanks! -- View this message in context:

suggest pyspark using 'with' for sparkcontext to be more 'pythonic'

2014-11-13 Thread freedafeng
It seems sparkcontext is good fit to be used with 'with' in python. A context manager will do. example: with SparkContext(conf=conf, batchSize=512) as sc: Then sc.stop() is not necessary to write any more. -- View this message in context:

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, This is my code, import org.apache.hadoop.hbase.CellUtil /** * JF: convert a Result object into a string with column family and qualifier names. Sth like * 'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2' etc. * k-v pairs are separated by ';'. different

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick, I saw the HBase api has experienced lots of changes. If I remember correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is 0.98.1. To get the column family names and qualifier names, we need to call different methods for these two different versions. I don't know how

pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there, I am wondering how to get the column family names and column qualifier names when using pyspark to read an hbase table with multiple column families. I have a hbase table as follows, hbase(main):007:0 scan 'data1' ROW COLUMN+CELL

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following, class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } } I feel using 'result.value()' here is a big

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter. Just couple of lines of code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html Sent from the

Re: How to ship cython library to workers?

2014-11-04 Thread freedafeng
Thanks for the solution! I did figure out how to create an .egg file to ship out to the workers. Using ipython seems to be another cool solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html Sent

Re: akka connection refused bug, fix?

2014-11-03 Thread freedafeng
Any one has experience or advice to fix this problem? highly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

IllegalStateException: unread block data

2014-11-03 Thread freedafeng
Hollo there, Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just installed spark to read/process data from a hbase in a different cluster. The spark was built against the hbase/hadoop version in the remote (ec2) hbase cluster, which is 0.98.1 and 2.3.0 respectively. but I

stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
Hi, Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple python code in a ec2 cluster. The same program runs correctly in local mode. So this error only happens when running in a real cluster. Here's what I got, 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
The worker side has error message as this, 14/10/30 18:29:00 INFO Worker: Asked to launch executor app-20141030182900-0006/0 for testspark_v1 14/10/30 18:29:01 INFO ExecutorRunner: Launch command: java -cp

akka connection refused bug, fix?

2014-10-30 Thread freedafeng
Hi, I saw the same issue as this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html Anyone has a fix for this bug? Please?! The log info in my worker node is like, 14/10/30 20:15:18 INFO Worker: Asked to kill executor

Re: akka connection refused bug, fix?

2014-10-30 Thread freedafeng
followed this http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602 but the problem was not fixed.. -- View this message in context:

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn command line parameters? like following mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package -- View this

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
I modified the pom files in my private repo to use those parameters as default to solve the problem. But after the deployment, I found the installed version is not the customized version, but an official one. Anyone please give a hint on how the spark-ec2 work with spark from private repos.. --

stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
what could cause this type of 'stage failure'? Thanks! This is a simple py spark script to list data in hbase. command line: ./spark-submit --driver-class-path ~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 14/10/21 17:53:50 INFO BlockManagerInfo: Added

Re: stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
maybe set up a hbase.jar in the conf? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread freedafeng
Thanks for the help! Hadoop version: 2.3.0 Hbase version: 0.98.1 Use python to read/write data from/to hbase. Only change over the official spark 1.1.0 is the pom file under examples. Compilation: spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in the cluster without using the example.jar, it works. But if I added the example.jar as the driver class (sth like follows), it will fail with an exception. Could anyone help with this? -- what is the cause of the problem?

Re: example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
Fixed by recompiling. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
The plan is to create an EC2 cluster and run the (py) spark on it. Input data is from s3, output data goes to an hbase in a persistent cluster (also EC2). My questions are: 1. I need to install some software packages on all the workers (sudo apt-get install ...). Is there a better way to do this

Re: EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
Maybe I should create a private AMI to use for my question No.1? Assuming I use the default instance type as the base image.. Anyone tried this? -- View this message in context:

performance comparison: join vs cogroup?

2014-10-06 Thread freedafeng
For two large key-value data sets, if they have the same set of keys, what is the fastest way to join them into one? Suppose all keys are unique in each data set, and we only care about those keys that appear in both data sets. input data I have: (k, v1) and (k, v2) data I want to get from the

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread freedafeng
I don't know if it's relevant, but I had to compile spark for my specific hbase and hadoop version to make that hbase_inputformat.py work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html Sent from

spark 1.1 examples build failure on cdh 5.1

2014-09-18 Thread freedafeng
This is a mvn build. [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact org.apache.hbase:hbase:jar:0.98.1 in central (https://repo1.maven.org/maven2) - [Help 1] [ERROR]

request to merge the pull request #1893 to master

2014-09-18 Thread freedafeng
We are working on a project that needs python + spark to work on hdfs and hbase data. We like to use a not-too-old version of hbase such as hbase 0.98.x. We have tried many different ways (and platforms) to compile and test Spark 1.1 official release, but got all sorts of issues. The only version

How to ship cython library to workers?

2014-09-17 Thread freedafeng
I have a library written in Cython and C. wondering if it can be shipped to the workers which don't have cython installed. maybe create an egg package from this library? how? -- View this message in context:

spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
Newbie for Java. so please be specific on how to resolve this, The command I was running is $ ./spark-submit --driver-class-path /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar

Re: spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
The same command passed in another quick-start vm (v4.7) which has hbase 0.96 installed. maybe there are some conflicts for the newer hbase version and spark 1.1.0? just my guess. Thanks. -- View this message in context:

Re: how to run python examples in spark 1.1?

2014-09-10 Thread freedafeng
Just want to provide more information on how I ran the examples. Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I created a table called 'data1', and 'put' two records in it. I can see the table and data are fine in hbase shell. I cloned spark repo and checked out to 1.1

how to run python examples in spark 1.1?

2014-09-09 Thread freedafeng
I'm mostly interested in the hbase examples in the repo. I saw two examples hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you show me how to run them? Compile step is done. I tried to run the examples, but failed. -- View this message in context: