from:"Jeff Zhang"

Re: Choice of IDE for Spark

2021-09-30 Thread Jeff Zhang

IIRC, you want an IDE for pyspark on yarn ?

Mich Talebzadeh  于2021年9月30日周四 下午7:00写道：

> Hi,
>
> This may look like a redundant question but it comes about because of the
> advent of Cloud workstation usage like Amazon workspaces and others.
>
> With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
> with PySpark and the virtual environment. Mind you as far as I know PyCharm
> only executes spark-submit in local mode. For yarn, one needs to open a
> terminal and submit from there.
>
> However, in Amazon workstation, you get Visual Studio Code
> <https://code.visualstudio.com/> (VSC, an MS product) and openoffice
> installed. With VSC, you get stuff for working with json files but I am not
> sure with a plugin for Python etc, will it be as good as PyCharm? Has
> anyone used VSC in anger for Spark and if so what is the experience?
>
> Thanks
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang

You can first try it via docker
http://zeppelin.apache.org/download.html#using-the-official-docker-image


Jeff Zhang  于2021年9月27日周一 上午6:49写道：

> Hi kumar,
>
> You can try Zeppelin which support the udf sharing across languages
>
> http://zeppelin.apache.org/
>
>
>
>
> rahul kumar  于2021年9月27日周一 上午4:20写道：
>
>> I'm trying to use a function defined in scala jar in pyspark ( spark
>> 3.0.2).
>>
>> --scala ---
>>
>> Object PythonUtil {
>>
>> def customedf(dataFrame: DataFrame,
>>  keyCol: String,
>>  table: String,
>>  outputSchema: StructType,
>>  database: String): DataFrame = {
>>
>> // some transformation of dataframe and convert as per the output schema
>> types and fields.
>> ...
>> resultDF
>> }
>>
>> //In jupyter notebook
>> schema creation:
>> alias = StructType([StructField("first_name", StringType(),
>> False),StructField("last_name", StringType(), False)])
>> name = StructType([StructField("first_name", StringType(),
>> False),StructField("aliases", ArrayType(alias), False)])
>> street_adress = StructType([StructField("street_name", StringType(),
>> False),StructField("apt_number", IntegerType(), False)])
>> address = StructType([StructField("zip", LongType(),
>> False),StructField("street", street_adress, False),StructField("city",
>> StringType(), False)])
>> workHistory = StructType([StructField("company_name", StringType(),
>> False),StructField("company_address", address,
>> False),StructField("worked_from", StringType(), False)])
>>
>> //passing this to scala function.
>> outputschema= StructType([StructField("name", name,
>> False),StructField("SSN", StringType(), False),StructField("home_address",
>> ArrayType(address), False)])
>>
>> ssns = [["825-55-3247"], ["289-18-1554"], ["756-46-4088"],
>> ["525-31-0299"], ["456-45-2200"], ["200-71-7765"]]
>> customerIdsDF=spark.createDataFrame(ssns,["SSN"])
>>
>> scala2_object= sc._jvm.com.mytest.spark.PythonUtil
>> pyspark.sql.DataFrame(scala2_object.customdf(customerIdsDF._jdf, 'SSN',
>> 'table', outputschema, 'test'), spark._wrapped)
>>
>> Then I get an error that AttributeError: 'StructField' object has no
>> attribute '_get_object_id'
>>
>> full stacktrace
>>
>> ---
>> AttributeErrorTraceback (most recent call
>> last)
>>  in 
>>   4
>>   5 scala2_object= sc._jvm.com.aerospike.spark.PythonUtil
>> > 6 pyspark.sql.DataFrame(scala2_object.customdf(customerIdsDF._jdf,
>> 'SSN', 'table',smallSchema, 'test'), spark._wrapped)
>>
>> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>> in __call__(self, *args)
>>1294
>>1295 def __call__(self, *args):
>> -> 1296 args_command, temp_args = self._build_args(*args)
>>1297
>>1298 command = proto.CALL_COMMAND_NAME +\
>>
>> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>> in _build_args(self, *args)
>>1258 def _build_args(self, *args):
>>1259 if self.converters is not None and len(self.converters) >
>> 0:
>> -> 1260 (new_args, temp_args) = self._get_args(args)
>>1261 else:
>>1262 new_args = args
>>
>> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>> in _get_args(self, args)
>>1245 for converter in self.gateway_client.converters:
>>1246 if converter.can_convert(arg):
>> -> 1247 temp_arg = converter.convert(arg,
>> self.gateway_client)
>>1248 temp_args.append(temp_arg)
>>1249 new_args.append(temp_arg)
>>
>> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py
>> in convert(self, object, gateway_client)
>> 509 java_list = ArrayList()
>> 510 for element in object:
>> --> 511 java_list.add(element)
>> 512 return java_list
>> 513
>>
>> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.z

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang

_command, temp_args
>
> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
> in (.0)
>1264
>1265 args_command = "".join(
> -> 1266 [get_command_part(arg, self.pool) for arg in new_args])
>1267
>1268 return args_command, temp_args
>
> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py
> in get_command_part(parameter, python_proxy_pool)
> 296 command_part += ";" + interface
> 297 else:
> --> 298 command_part = REFERENCE_TYPE + parameter._get_object_id()
> 299
> 300 command_part += "\n"
>
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Best Regards

Jeff Zhang

[ANNOUNCE] Apache Zeppelin 0.10.0 is released, Spark on Zeppelin Improved

2021-08-26 Thread Jeff Zhang

Hi Spark users,

We (Zeppelin community) are very excited to announce Apache Zeppelin
notebook 0.10.0 is officially released. Here’s the main features of Spark
on Zeppelin:

   - Support multiple versions of Spark - You can run different versions of
   Spark in one Zeppelin instance
   - Support multiple versions of Scala - You can run different Scala
   versions (2.10/2.11/2.12) of Spark in on Zeppelin instance
   - Support multiple languages - Scala, SQL, Python, R are supported,
   besides that you can also collaborate across languages, e.g. you can write
   Scala UDF and use it in PySpark
   - Support multiple execution modes - Local | Standalone | Yarn | K8s
   - Interactive development - Interactive development user experience
   increase your productivity
   - Inline Visualization - You can visualize Spark Dataset/DataFrame vis
   Python/R’s plotting libraries, and even you can make SparkR Shiny app in
   Zeppelin
   - Multi-tenancy - Multiple user can work in one Zeppelin instance
   without affecting each other.
   - Rest API Support - You can not only submit Spark job via Zeppelin
   notebook UI, but also can do that via its rest api (You can use Zeppelin as
   Spark job server).

And the easiest way to try Zeppelin is via docker container, check out this
link for how to use Spark in Zeppelin docker container.
https://zeppelin.apache.org/docs/0.10.0/interpreter/spark.html#play-spark-in-zeppelin-docker

Spark on Zeppelin doc:
https://zeppelin.apache.org/docs/0.10.0/interpreter/spark.html
Download it here: https://zeppelin.apache.org/download.html

-- 
Best Regards

Jeff Zhang
Twitter: zjffdu

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Jeff Zhang

Maybe you can try the zeppelin notebook. http://zeppelin.apache.org/


karan alang  于2021年8月17日周二 下午2:11写道：

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>
>

-- 
Best Regards

Jeff Zhang

Is the pandas version in doc of using pyarrow in spark wrong

2021-08-09 Thread Jeff Zhang

The doc says that the minimum supported pandas version is 0.23.2 which is
only supported in python2.
IIRC, python2 is not supported in pyspark a long time ago. Can any one
confirm whether the doc is wrong and what is the right version of pandas
and pyarrow ?

https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions

-- 
Best Regards

Jeff Zhang

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang

Yes, I guess so. But R 3.6.2 is just released this month, I think we should
use an older version to build SparkR.

Felix Cheung  于2019年12月27日周五 上午10:43写道：

> Maybe it’s the reverse - the package is built to run in latest but not
> compatible with slightly older (3.5.2 was Dec 2018)
>
> --
> *From:* Jeff Zhang 
> *Sent:* Thursday, December 26, 2019 5:36:50 PM
> *To:* Felix Cheung 
> *Cc:* user.spark 
> *Subject:* Re: Fail to use SparkR of 3.0 preview 2
>
> I use R 3.5.2
>
> Felix Cheung  于2019年12月27日周五 上午4:32写道：
>
> It looks like a change in the method signature in R base packages.
>
> Which version of R are you running on?
>
> --
> *From:* Jeff Zhang 
> *Sent:* Thursday, December 26, 2019 12:46:12 AM
> *To:* user.spark 
> *Subject:* Fail to use SparkR of 3.0 preview 2
>
> I tried SparkR of spark 3.0 preview 2, but hit the following issue.
>
> Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
>   number of columns of matrices must match (see arg 2)
> Error: package or namespace load failed for ‘SparkR’ in rbind(info,
> getNamespaceInfo(env, "S3methods")):
>  number of columns of matrices must match (see arg 2)
> During startup - Warning messages:
> 1: package ‘SparkR’ was built under R version 3.6.2
> 2: package ‘SparkR’ in options("defaultPackages") was not found
>
> Does anyone know what might be wrong ? Thanks
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


-- 
Best Regards

Jeff Zhang

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang

I use R 3.5.2

Felix Cheung  于2019年12月27日周五 上午4:32写道：

> It looks like a change in the method signature in R base packages.
>
> Which version of R are you running on?
>
> --
> *From:* Jeff Zhang 
> *Sent:* Thursday, December 26, 2019 12:46:12 AM
> *To:* user.spark 
> *Subject:* Fail to use SparkR of 3.0 preview 2
>
> I tried SparkR of spark 3.0 preview 2, but hit the following issue.
>
> Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
>   number of columns of matrices must match (see arg 2)
> Error: package or namespace load failed for ‘SparkR’ in rbind(info,
> getNamespaceInfo(env, "S3methods")):
>  number of columns of matrices must match (see arg 2)
> During startup - Warning messages:
> 1: package ‘SparkR’ was built under R version 3.6.2
> 2: package ‘SparkR’ in options("defaultPackages") was not found
>
> Does anyone know what might be wrong ? Thanks
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


-- 
Best Regards

Jeff Zhang

Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang

I tried SparkR of spark 3.0 preview 2, but hit the following issue.

Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
  number of columns of matrices must match (see arg 2)
Error: package or namespace load failed for ‘SparkR’ in rbind(info,
getNamespaceInfo(env, "S3methods")):
 number of columns of matrices must match (see arg 2)
During startup - Warning messages:
1: package ‘SparkR’ was built under R version 3.6.2
2: package ‘SparkR’ in options("defaultPackages") was not found

Does anyone know what might be wrong ? Thanks



-- 
Best Regards

Jeff Zhang

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Jeff Zhang

You can try to increase property spark.yarn.am.waitTime (by default it is
100s)
Maybe you are doing some very time consuming operation when initializing
SparkContext, which cause timeout.

See this property here
http://spark.apache.org/docs/latest/running-on-yarn.html


Jochen Hebbrecht  于2019年10月4日周五 下午10:08写道：

> Hi,
>
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
> towards the cluster. Thhe job gets accepted, but the YARN application fails
> with:
>
>
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [10
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [10
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
>
> It actually goes wrong at this line:
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>


-- 
Best Regards

Jeff Zhang

Re: [spark on yarn] spark on yarn without DFS

2019-05-19 Thread Jeff Zhang

I am afraid not, because yarn needs dfs.

Huizhe Wang  于2019年5月20日周一 上午9:50写道：

> Hi,
>
> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
> DataNode. I got an error when using yarn cluster mode. Could I using yarn
> without start DFS, how could I use this mode?
>
> Yours,
> Jane
>


-- 
Best Regards

Jeff Zhang

Re: Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-05-01 Thread Jeff Zhang

You can configure zeppelin to store your notes in S3

http://zeppelin.apache.org/docs/0.8.1/setup/storage/storage.html#notebook-storage-in-s3





V0lleyBallJunki3  于2019年5月1日周三 上午5:26写道：

> Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache
> Spark programs in Scala. The problem is that once that cluster is destroyed
> I lose all the notebooks on it. So over a period of time I have a lot of
> notebooks that require to be manually  exported into my local disk and from
> there imported to each new EMR cluster I create. Is there a notebook
> repository or tool that I can use where I can keep all my notebooks in a
> folder and access them even on new emr clusters. I know Jupyter is there
> but
> it doesn't support auto-complete for Scala.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Best Regards

Jeff Zhang

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang

Congrats, Great work Dongjoon.



Dongjoon Hyun  于2019年1月15日周二 下午3:47写道：

> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users to
> upgrade to this stable release.
>
> To download Spark 2.2.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-2-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>


-- 
Best Regards

Jeff Zhang

Re: Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Jeff Zhang

Make sure you use the correct python which has tensorframe installed.
Use PYSPARK_PYTHON
to configure the python



Spico Florin 于2018年8月8日周三 下午9:59写道：

> Hi!
>
> I would like to use tensorframes in my pyspark notebook.
>
> I have performed the following:
>
> 1. In the spark intepreter adde a new repository
> http://dl.bintray.com/spark-packages/maven
> 2. in the spark interpreter added the
> dependency databricks:tensorframes:0.2.9-s_2.11
> 3. pip install tensorframes
>
>
> In both 0.7.3 and 0.8.0:
> 1.  the following code resulted in error: "ImportError: No module named
> tensorframes"
>
> %pyspark
> import tensorframes as tfs
>
> 2. the following code succeeded
> %spark
> import org.tensorframes.{dsl => tf}
> import org.tensorframes.dsl.Implicits._
> val df = spark.createDataFrame(Seq(1.0->1.1, 2.0->2.2)).toDF("a", "b")
>
> // As in Python, scoping is recommended to prevent name collisions.
> val df2 = tf.withGraph {
> val a = df.block("a")
> // Unlike python, the scala syntax is more flexible:
> val out = a + 3.0 named "out"
> // The 'mapBlocks' method is added using implicits to dataframes.
> df.mapBlocks(out).select("a", "out")
> }
>
> // The transform is all lazy at this point, let's execute it with collect:
> df2.collect()
>
> I ran the code above directly with spark interpreter with the default
> configurations (master set up to local[*] - so not via spark-submit
> command) .
>
> Also, I have installed spark home locally and ran the command
>
> $SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.9-s_2.11
>
> and the code below worked as expcted
>
> import tensorframes as tfs
>
>  Can you please help to solve this?
>
> Thanks,
>
>  Florin
>
>
>
>
>
>
>
>
>

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Jeff Zhang

Check the yarn AM log for details.



Aakash Basu 于2018年6月8日周五 下午4:36写道：

> Hi,
>
> Getting this error when trying to run Spark Shell using YARN -
>
> Command: *spark-shell --master yarn --deploy-mode client*
>
> 2018-06-08 13:39:09 WARN  Client:66 - Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2018-06-08 13:39:25 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
>
>
> The last half of stack-trace -
>
> 2018-06-08 13:56:11 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - 
> Attempted to request executors before the AM has registered!
> 2018-06-08 13:56:11 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is 
> not running
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
>   ... 55 elided
> :14: error: not found: value spark
>import spark.implicits._
>   ^
> :14: error: not found: value spark
>import spark.sql
>
>
> Tried putting the *spark-yarn_2.11-2.3.0.jar *in Hadoop yarn, still not
> working, anything else to do?
>
> Thanks,
> Aakash.
>

Re: Livy Failed error on Yarn with Spark

2018-05-24 Thread Jeff Zhang

Could you check the the spark app's yarn log and livy log ?


Chetan Khatri 于2018年5月10日周四 上午4:18写道：

> All,
>
> I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I
> am running same spark job using spark-submit it is getting success with all
> transformations are done.
>
> When I am trying to do spark submit using Livy, at that time Spark Job is
> getting invoked and getting success but Yarn status says : FAILED and when
> you take a look on logs at attempt : Log says  SUCCESS and there is no
> error log.
>
> Any one has faced this weird exprience ?
>
> Thank you.
>

Re: [Spark] Supporting python 3.5?

2018-05-24 Thread Jeff Zhang

It supports python 3.5, and IIRC, spark also support python 3.6

Irving Duran 于2018年5月10日周四 下午9:08写道：

> Does spark now support python 3.5 or it is just 3.4.x?
>
> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>
> Thank You,
>
> Irving Duran
>

Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang

I don't think it is possible to have less than 1 core for AM, this is due
to yarn not spark.

The number of AM comparing to the number of executors should be small and
acceptable. If you do want to save more resources, I would suggest you to
use yarn cluster mode where driver and AM run in the same process.

You can either use livy or zeppelin which both support interactive work in
yarn cluster mode.

http://livy.incubator.apache.org/
https://zeppelin.apache.org/
https://medium.com/@zjffdu/zeppelin-0-8-0-new-features-ea53e8810235


Another approach to save resources is to share SparkContext across your
applications since your scenario is interactive work ( I guess it is some
kind of notebook).  Zeppelin support sharing SparkContext across users and
notes.



peay 于2018年5月18日周五 下午6:20写道：

> Hello,
>
> I run a Spark cluster on YARN, and we have a bunch of client-mode
> applications we use for interactive work. Whenever we start one of this, an
> application master container is started.
>
> My understanding is that this is mostly an empty shell, used to request
> further containers or get status from YARN. Is that correct?
>
> spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster.
> Because I am using DominantResourceCalculator to take vCores into account
> for scheduling, this results in a lot of unused CPU capacity overall
> because all those AMs each block one full vCore. With enough jobs, this
> adds up quickly.
>
> I am trying to understand if we can work around that -- ideally, by
> allocating fractional vCores (e.g., give 100 millicores to the AM), or by
> allocating no vCores at all for the AM (I am fine with a bit of
> oversubscription because of that).
>
> Any idea on how to avoid blocking so many YARN vCores just for the Spark
> AMs?
>
> Thanks!
>
>

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang

Hi Manuel,

Looks like you are using the virtualenv of spark. Virtualenv will create
python enviroment in executor.

>>> --conf 
>>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
\
And you are not making proper configuration, spark.pyspark.virtualenv.bin.path
should point to the virtualenv executable file which needs to be installed
on all the nodes of cluster. You can check the following link for more
details of how to use virtualenv in pyspark.

https://community.hortonworks.com/articles/104949/using-virtualenv-with-pyspark-1.html



Manuel Sopena Ballesteros 于2018年1月16日周二 上午8:02写道：

> Apologies, I copied the wrong spark-submit output from running in a
> cluster. Please find below the right output for the question asked:
>
>
>
> -bash-4.1$ spark-submit --master yarn \
>
> > --deploy-mode cluster \
>
> > --driver-memory 4g \
>
> > --executor-memory 2g \
>
> > --executor-cores 4 \
>
> > --queue default \
>
> > --conf spark.pyspark.virtualenv.enabled=true \
>
> > --conf spark.pyspark.virtualenv.type=native \
>
> > --conf
> spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
>
> > --conf
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
> \
>
> > --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
>
> > --py-files $HAIL_HOME/build/distributions/hail-python.zip \
>
> > test.py
>
>
>
> 18/01/16 10:42:49 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 18/01/16 10:42:50 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
> 18/01/16 10:42:50 INFO RMProxy: Connecting to ResourceManager at
> wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050
>
> 18/01/16 10:42:50 INFO Client: Requesting a new application from cluster
> with 4 NodeManagers
>
> 18/01/16 10:42:50 INFO Client: Verifying our application has not requested
> more than the maximum memory capability of the cluster (450560 MB per
> container)
>
> 18/01/16 10:42:50 INFO Client: Will allocate AM container, with 4505 MB
> memory including 409 MB overhead
>
> 18/01/16 10:42:50 INFO Client: Setting up container launch context for our
> AM
>
> 18/01/16 10:42:50 INFO Client: Setting up the launch environment for our
> AM container
>
> 18/01/16 10:42:50 INFO Client: Preparing resources for our AM container
>
> 18/01/16 10:42:51 INFO Client: Use hdfs cache file as spark.yarn.archive
> for HDP,
> hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
>
> 18/01/16 10:42:51 INFO Client: Source and destination file systems are the
> same. Not copying
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-all-spark.jar
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/requirements.txt ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/requirements.txt
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/test.py ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/test.py
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/pyspark.zip
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/py4j-0.10.4-src.zip
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-python.zip
>
> 18/01/16 10:42:52 INFO Client: Uploading resource
> file:/tmp/spark-592e7e0f-6faa-4c3c-ab0f-7dd1cff21d17/__spark_conf__8493747840734310444.zip
> ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/__spark_conf__.zip
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing view acls to: mansop
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing modify acls to: mansop
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing view acls groups to:
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing modify acls groups to:
>
> 18/01/16 10:42:52 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(mansop);
> groups with view permissions: Set(); users  with modify permissions:
> Set(mansop); groups

Re: PIG to Spark

2018-01-08 Thread Jeff Zhang

Pig support spark engine now, so you can leverage spark execution with pig
script.

I am afraid there's no solution to convert pig script to spark api code





Pralabh Kumar 于2018年1月8日周一 下午11:25写道：

> Hi
>
> Is there a convenient way /open source project to convert PIG scripts to
> Spark.
>
>
> Regards
> Pralabh Kumar
>

Re: pyspark configuration with Juyter

2017-11-03 Thread Jeff Zhang

You are setting PYSPARK_DRIVER to jupyter, please set it to python exec file


anudeep 于2017年11月3日周五 下午7:31写道：

> Hello experts,
>
> I install jupyter notebook thorugh anacoda, set the pyspark driver to use
> jupyter notebook.
>
> I see the below issue when i try to open pyspark.
>
> anudeepg@datanode2 spark-2.1.0]$ ./bin/pyspark
> [I 07:29:53.184 NotebookApp] The port  is already in use, trying
> another port.
> [I 07:29:53.211 NotebookApp] JupyterLab alpha preview extension loaded
> from /home/anudeepg/anaconda2/lib/python2.7/site-packages/jupyterlab
> JupyterLab v0.27.0
> Known labextensions:
> [I 07:29:53.212 NotebookApp] Running the core application with no
> additional extensions or settings
> [I 07:29:53.214 NotebookApp] Serving notebooks from local directory:
> /opt/mapr/spark/spark-2.1.0
> [I 07:29:53.214 NotebookApp] 0 active kernels
> [I 07:29:53.214 NotebookApp] The Jupyter Notebook is running at:
> http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e
> [I 07:29:53.214 NotebookApp] Use Control-C to stop this server and shut
> down all kernels (twice to skip confirmation).
> [W 07:29:53.214 NotebookApp] No web browser found: could not locate
> runnable browser.
> [C 07:29:53.214 NotebookApp]
>
> Copy/paste this URL into your browser when you connect for the first
> time,
> to login with a token:
>
> http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e
>
>
> Can someone please help me here.
>
> Thanks!
> Anudeep
>
>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Jeff Zhang

Awesome !

Hyukjin Kwon 于2017年7月13日周四 上午8:48写道：

> Cool!
>
> 2017-07-13 9:43 GMT+09:00 Denny Lee :
>
>> This is amazingly awesome! :)
>>
>> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
>> wrote:
>>
>>> That's great!
>>>
>>>
>>>
>>> On 12 July 2017 at 12:41, Felix Cheung 
>>> wrote:
>>>
 Awesome! Congrats!!

 --
 *From:* holden.ka...@gmail.com  on behalf of
 Holden Karau 
 *Sent:* Wednesday, July 12, 2017 12:26:00 PM
 *To:* user@spark.apache.org
 *Subject:* With 2.2.0 PySpark is now available for pip install from
 PyPI :)

 Hi wonderful Python + Spark folks,

 I'm excited to announce that with Spark 2.2.0 we finally have PySpark
 published on PyPI (see https://pypi.python.org/pypi/pyspark /
 https://twitter.com/holdenkarau/status/885207416173756417). This has
 been a long time coming (previous releases included pip installable
 artifacts that for a variety of reasons couldn't be published to PyPI). So
 if you (or your friends) want to be able to work with PySpark locally on
 your laptop you've got an easier path getting started (pip install 
 pyspark).

 If you are setting up a standalone cluster your cluster will still need
 the "full" Spark packaging, but the pip installed PySpark should be able to
 work with YARN or an existing standalone cluster installation (of the same
 version).

 Happy Sparking y'all!

 Holden :)

 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>

Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang

Seems it is caused by your log4j file

*Caused by: java.lang.IllegalStateException: FileNamePattern [-.log]
does not contain a valid date format specifier*




于2017年4月6日周四 下午4:03写道：

> Hi All ,
>
>
>
>I am just trying to use scala test for testing a small spark code . But
> spark context is not getting initialized , while I am running test file .
>
> I have given code, pom and exception I am getting in mail , please help me
> to understand what mistake I am doing , so that
>
> Spark context is not getting initialized
>
>
>
> *Code:-*
>
>
>
> *import *org.apache.log4j.LogManager
> *import *org.apache.spark.SharedSparkContext
> *import *org.scalatest.FunSuite
> *import *org.apache.spark.{SparkContext, SparkConf}
>
>
>
>
> */**  * Created by PSwain on 4/5/2017.   */ **class *Test *extends *FunSuite
> *with *SharedSparkContext  {
>
>
>   test(*"test initializing spark context"*) {
> *val *list = *List*(1, 2, 3, 4)
> *val *rdd = sc.parallelize(list)
> assert(list.length === rdd.count())
>   }
> }
>
>
>
> *POM File:-*
>
>
>
> * *?>*<*project **xmlns=*
> *"http://maven.apache.org/POM/4.0.0 "  
>**xmlns:**xsi**=*
> *"http://www.w3.org/2001/XMLSchema-instance 
> " 
> **xsi**:schemaLocation=**"http://maven.apache.org/POM/4.0.0 
>  
> http://maven.apache.org/xsd/maven-4.0.0.xsd 
> "*>
> <*modelVersion*>4.0.0
>
> <*groupId*>tesing.loging
> <*artifactId*>logging
> <*version*>1.0-SNAPSHOT
>
>
> <*repositories*>
> <*repository*>
> <*id*>central
> <*name*>central
> <*url*>http://repo1.maven.org/maven/
> 
> 
>
> <*dependencies*>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-core_2.10
> <*version*>1.6.0
> <*type*>test-jar
>
>
> 
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-sql_2.10
> <*version*>1.6.0
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest_2.10
> <*version*>2.2.6
> 
>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-hive_2.10
> <*version*>1.5.0
> <*scope*>provided
> 
> <*dependency*>
> <*groupId*>com.databricks
> <*artifactId*>spark-csv_2.10
> <*version*>1.3.0
> 
> <*dependency*>
> <*groupId*>com.rxcorp.bdf.logging
> <*artifactId*>loggingframework
> <*version*>1.0-SNAPSHOT
> 
> <*dependency*>
> <*groupId*>mysql
> <*artifactId*>mysql-connector-java
> <*version*>5.1.6
> <*scope*>provided
> 
>
> **<*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-library
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest
> <*version*>1.4.RC2
> 
>
> <*dependency*>
> <*groupId*>log4j
> <*artifactId*>log4j
> <*version*>1.2.17
> 
>
> <*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-compiler
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> **
> <*build*>
> <*sourceDirectory*>src/main/scala
> <*plugins*>
> <*plugin*>
> <*artifactId*>maven-assembly-plugin
> <*version*>2.2.1
> <*configuration*>
> <*descriptorRefs*>
> 
> <*descriptorRef*>jar-with-dependencies
> 
> 
> <*executions*>
> <*execution*>
> <*id*>make-assembly
> <*phase*>package
> <*goals*>
> <*goal*>single
> 
> 
> 
> 
> <*plugin*>
> <*groupId*>net.alchim31.maven
> <*artifactId*>scala-maven-plugin
> <*version*>3.2.0
> <*executions*>
> <*execution*>
> <*goals*>
> <*goal*>compile
> <*goal*>testCompile
> 
> 
> 
> <*configuration*>
> <*sourceDir*>src/main/scala
>
> <*jvmArgs*>
>

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Jeff Zhang

It is fixed in https://issues.apache.org/jira/browse/SPARK-13330



Holden Karau 于2017年4月5日周三 上午12:03写道：

> Which version of Spark is this (or is it a dev build)? We've recently made
> some improvements with PYTHONHASHSEED propagation.
>
> On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern  cal.com> wrote:
>
> 2017-04-01 21:54 GMT+02:00 Paul Tremblay :
>
> When I try to to do a groupByKey() in my spark environment, I get the
> error described here:
>
>
> http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh
>
> In order to attempt to fix the problem, I set up my ipython environment
> with the additional line:
>
> PYTHONHASHSEED=1
>
> When I fire up my ipython shell, and do:
>
> In [7]: hash("foo")
> Out[7]: -2457967226571033580
>
> In [8]: hash("foo")
> Out[8]: -2457967226571033580
>
> So my hash function is now seeded so it returns consistent values. But
> when I do a groupByKey(), I get the same error:
>
>
> Exception: Randomness of hash of string should be disabled via
> PYTHONHASHSEED
>
> Anyone know how to fix this problem in python 3.4?
>
>
> Independent of the python version, you have to ensure that Python on
> spark-master and -workers is started with PYTHONHASHSEED set, e.g. by
> adding it to the environment of the spark processes.
>
> Best
>
> Eike
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Re: 答复: submit spark task on yarn asynchronously via java?

2016-12-25 Thread Jeff Zhang

Or you can use livy for submit spark jobs

http://livy.io/



Linyuxin 于2016年12月26日周一 上午10:32写道：

> Thanks.
>
>
>
> *发件人:* Naveen [mailto:hadoopst...@gmail.com]
> *发送时间:* 2016年12月25日 0:33
> *收件人:* Linyuxin 
> *抄送:* user 
> *主题:* Re: 答复: submit spark task on yarn asynchronously via java?
>
>
>
> Hi,
>
> Please use SparkLauncher API class and invoke the threads using async
> calls using Futures.
>
> Using SparkLauncher, you can mention class name, application resouce,
> arguments to be passed to the driver, deploy-mode etc.
>
> I would suggest to use scala's Future, is scala code is possible.
>
>
>
> https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/launcher/SparkLauncher.html
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html
>
>
> On Fri, Dec 23, 2016 at 7:10 AM, Linyuxin  wrote:
>
> Hi,
>
> Could Anybody help?
>
>
>
> *发件人**:* Linyuxin
> *发送时间**:* 2016年12月22日 14:18
> *收件人**:* user 
> *主题**:* submit spark task on yarn asynchronously via java?
>
>
>
> Hi All,
>
>
>
> Version:
>
> Spark 1.5.1
>
> Hadoop 2.7.2
>
>
>
> Is there any way to submit and monitor spark task on yarn via java
> asynchronously?
>
>
>
>
>
>
>

Re: HiveContext is Serialized?

2016-10-25 Thread Jeff Zhang

In your sample code, you can use hiveContext in the foreach as it is scala
List foreach operation which runs in driver side. But you cannot use
hiveContext in RDD.foreach



Ajay Chander 于2016年10月26日周三 上午11:28写道：

> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>

Re: Using Zeppelin with Spark FP

2016-09-11 Thread Jeff Zhang

You can plot data frame. But it is not supported for RDD AFAIK.

On Mon, Sep 12, 2016 at 5:12 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Zeppelin is getting better.
>
> In its description it says:
>
> [image: Inline images 1]
>
> So far so good. One feature that I have not managed to work out is
> creating plots with Spark functional programming. I can get SQL going by
> connecting to Spark thrift server and you can plot the results
>
> [image: Inline images 2]
>
> However, if I wrote that using functional programming I won't be able to
> plot it. the plot feature is not available.
>
> Is this correct or I am missing something?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>



-- 
Best Regards

Jeff Zhang

Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-05 Thread Jeff Zhang

> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
> at org.apache.hive.service.cli.HiveSQLException.newInstance(
> HiveSQLException.java:244)
>
> at org.apache.hive.service.cli.HiveSQLException.toStackTrace(
> HiveSQLException.java:210)
>
> ... 15 more
>
> Error: Error retrieving next row (state=,code=0)
>
>
>
> The same command works when using Spark 1.6, is it a possible issue?
>
>
>
> Thanks!
>



-- 
Best Regards

Jeff Zhang

Re: spark run shell On yarn

2016-07-28 Thread Jeff Zhang

One workaround is disable timeline in yarn-site,

set yarn.timeline-service.enabled as false in yarn-site.xml

On Thu, Jul 28, 2016 at 5:31 PM, censj <ce...@lotuseed.com> wrote:

> 16/07/28 17:07:34 WARN shortcircuit.DomainSocketFactory: The short-circuit
> local reads feature cannot be used because libhadoop cannot be loaded.
> java.lang.NoClassDefFoundError:
> com/sun/jersey/api/client/config/ClientConfig
>   at
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
>   at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
>   at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
>   at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
>   at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.ClassNotFoundException:
> com.sun.jersey.api.client.config.ClientConfig
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 60 more
> :14: error: not found: value spark
>import spark.implicits._
>   ^
> :14: error: not found: value spark
>import spark.sql
>   ^
> Welcome to
>
>
>
>
> hi:
> I use spark 2.0,but when I run
>  "/etc/spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn” , appear
> this Error.
>
> /etc/spark-2.0.0-bin-hadoop2.6/bin/spark-submit
> export YARN_CONF_DIR=/etc/hadoop/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6
>
>
> how I to update?
>
>
>
>
>
> ===
> Name: cen sujun
> Mobile: 13067874572
> Mail: ce...@lotuseed.com
>
>


-- 
Best Regards

Jeff Zhang

Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang

Any reason why you want to set this on hdfs ?

On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>
> I tried setting up temp folder to HDFS but it didn't worked can
> spark.local.dir write to HDFS ?
>
> .set("spark.local.dir","hdfs://namednode/spark_tmp/")
>
>
> 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
> hdfs://namenode/spark_tmp/. Ignoring this directory.
> java.io.IOException: Failed to create a temp directory (under
> hdfs://namenode/spark_tmp/) after 10 attempts!
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang

I think so, any reason you want to deploy multiple thrift server on one
machine ?

On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov <pahomov.e...@gmail.com>
wrote:

> Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
> Jeff, thanks, I would try, but from your answer I'm getting the feeling,
> that I'm trying some very rare case?
>
> 2016-07-01 10:54 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> This is not a bug, because these 2 processes use the same SPARK_PID_DIR
>> which is /tmp by default.  Although you can resolve this by using
>> different SPARK_PID_DIR, I suspect you would still have other issues like
>> port conflict. I would suggest you to deploy one spark thrift server per
>> machine for now. If stick to deploy multiple spark thrift server on one
>> machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
>> SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
>> there's other conflicts. but please try first.
>>
>>
>> On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> wrote:
>>
>>> I get
>>>
>>> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
>>> process 28989.  Stop it first."
>>>
>>> Is it a bug?
>>>
>>> 2016-07-01 10:10 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>>>
>>>> I don't think the one instance per machine is true.  As long as you
>>>> resolve the conflict issue such as port conflict, pid file, log file and
>>>> etc, you can run multiple instances of spark thrift server.
>>>>
>>>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov <pahomov.e...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>>>>> bother me -
>>>>>
>>>>> 1) One instance per machine
>>>>> 2) Yarn client only(not yarn cluster)
>>>>>
>>>>> Are there any architectural reasons for such limitations? About
>>>>> yarn-client I might understand in theory - master is the same process as a
>>>>> server, so it makes some sense, but it's really inconvenient - I need a 
>>>>> lot
>>>>> of memory on my driver machine. Reasons for one instance per machine I do
>>>>> not understand.
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> *Sincerely yoursEgor Pakhomov*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 
Best Regards

Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang

This is not a bug, because these 2 processes use the same SPARK_PID_DIR
which is /tmp by default.  Although you can resolve this by using
different SPARK_PID_DIR, I suspect you would still have other issues like
port conflict. I would suggest you to deploy one spark thrift server per
machine for now. If stick to deploy multiple spark thrift server on one
machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
there's other conflicts. but please try first.

On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov <pahomov.e...@gmail.com>
wrote:

> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989.  Stop it first."
>
> Is it a bug?
>
> 2016-07-01 10:10 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> I don't think the one instance per machine is true.  As long as you
>> resolve the conflict issue such as port conflict, pid file, log file and
>> etc, you can run multiple instances of spark thrift server.
>>
>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> wrote:
>>
>>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>>> bother me -
>>>
>>> 1) One instance per machine
>>> 2) Yarn client only(not yarn cluster)
>>>
>>> Are there any architectural reasons for such limitations? About
>>> yarn-client I might understand in theory - master is the same process as a
>>> server, so it makes some sense, but it's really inconvenient - I need a lot
>>> of memory on my driver machine. Reasons for one instance per machine I do
>>> not understand.
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>

-- 
Best Regards

Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang

I don't think the one instance per machine is true.  As long as you resolve
the conflict issue such as port conflict, pid file, log file and etc, you
can run multiple instances of spark thrift server.

On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov <pahomov.e...@gmail.com> wrote:

> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really bother
> me -
>
> 1) One instance per machine
> 2) Yarn client only(not yarn cluster)
>
> Are there any architectural reasons for such limitations? About
> yarn-client I might understand in theory - master is the same process as a
> server, so it makes some sense, but it's really inconvenient - I need a lot
> of memory on my driver machine. Reasons for one instance per machine I do
> not understand.
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>

-- 
Best Regards

Jeff Zhang

Re: Remote RPC client disassociated

2016-06-30 Thread Jeff Zhang

$$anonfun$run$3.apply(PythonRDD.scala:280)
>
>   at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>
>   at
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>
> 16/06/30 10:44:34 ERROR util.SparkUncaughtExceptionHandler: Uncaught
> exception in thread Thread[stdout writer for python,5,main]
>
> java.lang.AbstractMethodError:
> pyspark_cassandra.DeferringRowReader.read(Lcom/datastax/driver/core/Row;Lcom/datastax/spark/connector/CassandraRowMetadata;)Ljava/lang/Object;
>
>   at
> com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:315)
>
>   at
> com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:315)
>
>   at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>   at
> scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>
>   at
> com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
>
>   at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>   at
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914)
>
>   at
> scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
>
>       at
> scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:968)
>
>   at
> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>
>   at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
>   at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>   at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
>   at
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
>
>   at
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
>
>   at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>
>at
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
> BR
>
>
>
> Joaquin
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>



-- 
Best Regards

Jeff Zhang

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang

Hi Pedro,

Your use case is interesting.  I think launching java gateway is the same
as native SparkContext, the only difference is on creating your custom
SparkContext instead of native SparkContext. You might also need to wrap it
using java.

https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172



On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Hi All,
>
> I have written a Scala package which essentially wraps the SparkContext
> around a custom class that adds some functionality specific to our internal
> use case. I am trying to figure out the best way to call this from PySpark.
>
> I would like to do this similarly to how Spark itself calls the JVM
> SparkContext as in:
> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py
>
> My goal would be something like this:
>
> Scala Code (this is done):
> >>> import com.company.mylibrary.CustomContext
> >>> val myContext = CustomContext(sc)
> >>> val rdd: RDD[String] = myContext.customTextFile("path")
>
> Python Code (I want to be able to do this):
> >>> from company.mylibrary import CustomContext
> >>> myContext = CustomContext(sc)
> >>> rdd = myContext.customTextFile("path")
>
> At the end of each code, I should be working with an ordinary RDD[String].
>
> I am trying to access my Scala class through sc._jvm as below, but not
> having any luck so far.
>
> My attempts:
> >>> a = sc._jvm.com.company.mylibrary.CustomContext
> >>> dir(a)
> ['']
>
> Example of what I want::
> >>> a = sc._jvm.PythonRDD
> >>> dir(a)
> ['anonfun$6', 'anonfun$8', 'collectAndServe',
> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile',
> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD',
> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions',
> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions',
> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions',
> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob',
> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile',
> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair',
> 'writeIteratorToStream', 'writeUTF']
>
> The next thing I would run into is converting the JVM RDD[String] back to
> a Python RDD, what is the easiest way to do this?
>
> Overall, is this a good approach to calling the same API in Scala and
> Python?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Best Regards

Jeff Zhang

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang

You might have multiple java servlet jars on your classpath.

On Fri, Jun 24, 2016 at 3:31 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> can you please check the yarn log files to see what they say (both the
> nodemamager and resourcemanager)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 24 June 2016 at 08:14, puneet kumar <puneetkumar.2...@gmail.com> wrote:
>
>>
>>
>> I am getting below error thrown when I submit Spark Job using Spark
>> Submit on Yarn. Need a quick help on what's going wrong here.
>>
>> 16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5: 
>> java.lang.IllegalStateException: class 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>> javax.servlet.Filter
>> java.lang.IllegalStateException: class 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>> javax.servlet.Filter
>>  at 
>> org.spark-project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:768)
>>  at 
>> org.spark-project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
>>  at 
>> org.spark-project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:717)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerCollection.doStart(HandlerCollection.java:229)
>>  at 
>> org.spark-project.jetty.server.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:172)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>  at org.spark-project.jetty.server.Server.doStart(Server.java:282)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
>>  at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>  at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>  at 
>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
>>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>  at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
>>  at 
>> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
>>  at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>  at scala.Option.foreach(Option.scala:236)
>>  at org.apache.spark.SparkContext.(SparkContext.scala:481)
>>  at 
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Jeff Zhang

You need to
spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
under build path, this is the only thing you need to do manually if I
remember correctly.



On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote:

> Hi Jeff,
>   I'd like to understand what may be different. I have rebuilt and
> reimported many times.  Just now I blew away the .idea/* and *.iml to start
> from scratch.  I just opened the $SPARK_HOME directory from intellij File |
> Open  .  After it finished the initial import I tried to run one of the
> Examples - and it fails in the build:
>
> Here are the errors I see:
>
> Error:(45, 66) not found: type SparkFlumeProtocol
>   val transactionTimeout: Int, val backOffInterval: Int) extends
> SparkFlumeProtocol with Logging {
>  ^
> Error:(70, 39) not found: type EventBatch
>   override def getEventBatch(n: Int): EventBatch = {
>   ^
> Error:(85, 13) not found: type EventBatch
> new EventBatch("Spark sink has been stopped!", "",
> java.util.Collections.emptyList())
> ^
>
> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
> Error:(80, 22) not found: type EventBatch
>   def getEventBatch: EventBatch = {
>  ^
> Error:(48, 37) not found: type EventBatch
>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
> Error", "",
> ^
> Error:(48, 54) not found: type EventBatch
>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
> Error", "",
>  ^
> Error:(115, 41) not found: type SparkSinkEvent
> val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
> ^
> Error:(146, 28) not found: type EventBatch
>   eventBatch = new EventBatch("", seqNum, events)
>^
>
> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
> Error:(25, 27) not found: type EventBatch
>   def isErrorBatch(batch: EventBatch): Boolean = {
>   ^
>
> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
> Error:(86, 51) not found: type SparkFlumeProtocol
> val responder = new SpecificResponder(classOf[SparkFlumeProtocol],
> handler.get)
>   ^
>
>
> Note: this is just the first batch of errors.
>
>
>
>
> 2016-06-22 20:50 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> It works well with me. You can try reimport it into intellij.
>>
>> On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>>>
>>> Building inside intellij is an ever moving target. Anyone have the
>>> magical procedures to get it going for 2.X?
>>>
>>> There are numerous library references that - although included in the
>>> pom.xml build - are for some reason not found when processed within
>>> Intellij.
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang

It works well with me. You can try reimport it into intellij.

On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com> wrote:

>
> Building inside intellij is an ever moving target. Anyone have the magical
> procedures to get it going for 2.X?
>
> There are numerous library references that - although included in the
> pom.xml build - are for some reason not found when processed within
> Intellij.
>

-- 
Best Regards

Jeff Zhang

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Jeff Zhang

Make sure you built spark with -Pyarn, and check whether you have
class ExecutorLauncher in your spark assembly jar.


On Wed, Jun 22, 2016 at 2:04 PM, Yash Sharma <yash...@gmail.com> wrote:

> How about supplying the jar directly in spark submit -
>
> ./bin/spark-submit \
>> --class org.apache.spark.examples.SparkPi \
>> --master yarn-client \
>> --driver-memory 512m \
>> --num-executors 2 \
>> --executor-memory 512m \
>> --executor-cores 2 \
>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>
>
> On Wed, Jun 22, 2016 at 3:59 PM, 另一片天 <958943...@qq.com> wrote:
>
>> i  config this  para  at spark-defaults.conf
>> spark.yarn.jar
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>
>> then ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>> --master yarn-client --driver-memory 512m --num-executors 2
>> --executor-memory 512m --executor-cores 210:
>>
>>
>>
>>- Error: Could not find or load main class
>>org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> but  i don't config that para ,there no error  why???that para is only
>> avoid Uploading resource file(jar package)??
>>
>
>


-- 
Best Regards

Jeff Zhang

Re: Does saveAsHadoopFile depend on master?

2016-06-21 Thread Jeff Zhang

Please check the driver and executor log, there should be logs about where
the data is written.



On Wed, Jun 22, 2016 at 2:03 AM, Pierre Villard <pierre.villard...@gmail.com
> wrote:

> Hi,
>
> I have a Spark job writing files to HDFS using .saveAsHadoopFile method.
>
> If I run my job in local/client mode, it works as expected and I get all
> my files written in HDFS. However if I change to yarn/cluster mode, I don't
> see any error logs (the job is successful) and there is no files written to
> HDFS.
>
> Is there any reason for this behavior? Any thoughts on how to track down
> what is happening here?
>
> Thanks!
>
> Pierre.
>



-- 
Best Regards

Jeff Zhang

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Jeff Zhang

Could you check the yarn app logs for details ? run command "yarn logs
-applicationId " to get the yarn log

On Tue, Jun 21, 2016 at 9:18 AM, wgtmac <ust...@gmail.com> wrote:

> I ran into problems in building Spark 2.0. The build process actually
> succeeded but when I uploaded to cluster and launched the Spark shell on
> YARN, it reported following exceptions again and again:
>
> 16/06/17 03:32:00 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container marked as failed: container_e437_1464601161543_1582846_01_13
> on host: hadoopworker575-sjc1.. Exit status: 1.
> Diagnostics:
> Exception from container-launch.
> Container id: container_e437_1464601161543_1582846_01_13
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1
>
> =
> Build command:
>
> export JAVA_HOME=   // tried both java7 and java8
> ./dev/change-scala-version.sh 2.11   // tried both 2.10 and 2.11
> ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
> -Phive-thriftserver -DskipTests clean package
>
> The 2.0.0-preview version downloaded from Spark website works well so it is
> not the problem of my cluster. Also I can make it to build Spark 1.5 and
> 1.6
> and run them on the cluster. But in Spark 2.0, I failed both 2.0.0-preview
> tag and 2.0.0-SNAPSHOT. Anyone has any idea? Thanks!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-2-0-succeeded-but-could-not-run-it-on-YARN-tp27199.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: How to deal with tasks running too long?

2016-06-16 Thread Jeff Zhang

This  may be due to data skew

On Thu, Jun 16, 2016 at 12:45 PM, Utkarsh Sengar <utkarsh2...@gmail.com>
wrote:

> This SO question was asked about 1yr ago.
>
> http://stackoverflow.com/questions/31799755/how-to-deal-with-tasks-running-too-long-comparing-to-others-in-job-in-yarn-cli
>
> I answered this question with a suggestion to try speculation but it
> doesn't quite do what the OP expects. I have been running into this issue
> more these days. Out of 5000 tasks, 4950 completes in 5mins but the last 50
> never really completes, have tried waiting for 4hrs. This can be a memory
> issue or maybe the way spark's fine grained mode works with mesos, I am
> trying to enable jmxsink to get a heap dump.
>
> But in the mean time, is there a better fix for this? (in any version of
> spark, I am using 1.5.1 but can upgrade). It would be great if the last 50
> tasks in my example can be killed (timed out) and the stage completes
> successfully.
>
> --
> Thanks,
> -Utkarsh
>



-- 
Best Regards

Jeff Zhang

Re: java server error - spark

2016-06-15 Thread Jeff Zhang

Then the only solution is to increase your driver memory but still
restricted by your machine's memory.  "--driver-memory"

On Thu, Jun 16, 2016 at 9:53 AM, spR <data.smar...@gmail.com> wrote:

> Hey,
>
> But I just have one machine. I am running everything on my laptop. Won't I
> be able to do this processing in local mode then?
>
> Regards,
> Tejaswini
>
> On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> You are using local mode, --executor-memory  won't take effect for local
>> mode, please use other cluster mode.
>>
>> On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> Specify --executor-memory in your spark-submit command.
>>>
>>>
>>>
>>> On Thu, Jun 16, 2016 at 9:01 AM, spR <data.smar...@gmail.com> wrote:
>>>
>>>> Thank you. Can you pls tell How to increase the executor memory?
>>>>
>>>>
>>>>
>>>> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>>
>>>>> It is OOM on the executor.  Please try to increase executor memory.
>>>>> "--executor-memory"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 16, 2016 at 8:54 AM, spR <data.smar...@gmail.com> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> error trace -
>>>>>>
>>>>>> hey,
>>>>>>
>>>>>>
>>>>>> error trace -
>>>>>>
>>>>>>
>>>>>> ---Py4JJavaError
>>>>>>  Traceback (most recent call 
>>>>>> last) in ()> 1 temp.take(2)
>>>>>>
>>>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>>>>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css: 
>>>>>>305 port = 
>>>>>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>>>>>  306 self._jdf, num)307 return 
>>>>>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>>>>>> 308
>>>>>>
>>>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>>>>  in __call__(self, *args)811 answer = 
>>>>>> self.gateway_client.send_command(command)812 return_value = 
>>>>>> get_return_value(--> 813 answer, self.gateway_client, 
>>>>>> self.target_id, self.name)814
>>>>>> 815 for temp_arg in temp_args:
>>>>>>
>>>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>>>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 
>>>>>> try:---> 45 return f(*a, **kw) 46 except 
>>>>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>>>>> e.java_exception.toString()
>>>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>>>>  in get_return_value(answer, gateway_client, target_id, name)306 
>>>>>> raise Py4JJavaError(307 "An error 
>>>>>> occurred while calling {0}{1}{2}.\n".--> 308 
>>>>>> format(target_id, ".", name), value)309 else:
>>>>>> 310 raise Py4JError(
>>>>>> Py4JJavaError: An error occurred while calling 
>>>>>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>>>>>> Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 
>>>>>> in stage 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC 
>>>>>> overhead limit exceeded
>>>>>>  at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>>>>>

Re: java server error - spark

2016-06-15 Thread Jeff Zhang

You are using local mode, --executor-memory  won't take effect for local
mode, please use other cluster mode.

On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> Specify --executor-memory in your spark-submit command.
>
>
>
> On Thu, Jun 16, 2016 at 9:01 AM, spR <data.smar...@gmail.com> wrote:
>
>> Thank you. Can you pls tell How to increase the executor memory?
>>
>>
>>
>> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>>
>>> It is OOM on the executor.  Please try to increase executor memory.
>>> "--executor-memory"
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 16, 2016 at 8:54 AM, spR <data.smar...@gmail.com> wrote:
>>>
>>>> Hey,
>>>>
>>>> error trace -
>>>>
>>>> hey,
>>>>
>>>>
>>>> error trace -
>>>>
>>>>
>>>> ---Py4JJavaError
>>>>  Traceback (most recent call 
>>>> last) in ()> 1 temp.take(2)
>>>>
>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:   
>>>>  305 port = 
>>>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>>>  306 self._jdf, num)307 return 
>>>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>>>> 308
>>>>
>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>>  in __call__(self, *args)811 answer = 
>>>> self.gateway_client.send_command(command)812 return_value = 
>>>> get_return_value(--> 813 answer, self.gateway_client, 
>>>> self.target_id, self.name)814
>>>> 815 for temp_arg in temp_args:
>>>>
>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
>>>> 45 return f(*a, **kw) 46 except 
>>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>>> e.java_exception.toString()
>>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>>  in get_return_value(answer, gateway_client, target_id, name)306   
>>>>   raise Py4JJavaError(307 "An error 
>>>> occurred while calling {0}{1}{2}.\n".--> 308 
>>>> format(target_id, ".", name), value)309 else:
>>>> 310 raise Py4JError(
>>>> Py4JJavaError: An error occurred while calling 
>>>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
>>>> 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
>>>> 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit 
>>>> exceeded
>>>>at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>>>at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>>>at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>>>at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>>>at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>>>at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>>>at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>>>at 
>>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>>>at 
>>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>>>at 
>>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>>>at 
>>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>>>at org.apache.spark.rdd.RDD.computeOrReadCh

Re: java server error - spark

2016-06-15 Thread Jeff Zhang

Specify --executor-memory in your spark-submit command.



On Thu, Jun 16, 2016 at 9:01 AM, spR <data.smar...@gmail.com> wrote:

> Thank you. Can you pls tell How to increase the executor memory?
>
>
>
> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>>
>> It is OOM on the executor.  Please try to increase executor memory.
>> "--executor-memory"
>>
>>
>>
>>
>>
>> On Thu, Jun 16, 2016 at 8:54 AM, spR <data.smar...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> error trace -
>>>
>>> hey,
>>>
>>>
>>> error trace -
>>>
>>>
>>> ---Py4JJavaError
>>>  Traceback (most recent call 
>>> last) in ()> 1 temp.take(2)
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:
>>> 305 port = 
>>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>>  306 self._jdf, num)307 return 
>>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>>> 308
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>  in __call__(self, *args)811 answer = 
>>> self.gateway_client.send_command(command)812 return_value = 
>>> get_return_value(--> 813 answer, self.gateway_client, 
>>> self.target_id, self.name)814
>>> 815 for temp_arg in temp_args:
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
>>> 45 return f(*a, **kw) 46 except 
>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>> e.java_exception.toString()
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>  in get_return_value(answer, gateway_client, target_id, name)306
>>>  raise Py4JJavaError(307 "An error occurred 
>>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>>> ".", name), value)309 else:
>>> 310 raise Py4JError(
>>> Py4JJavaError: An error occurred while calling 
>>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
>>> 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit 
>>> exceeded
>>> at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>> at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spa

Re: java server error - spark

2016-06-15 Thread Jeff Zhang

rg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
>   at 
> org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126)
>   at 
> org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
>   at 
> org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:124)
>   at 
> org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>   at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>   at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>   at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>   at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$Tas

Re: java server error - spark

2016-06-15 Thread Jeff Zhang

Could you paste the full stacktrace ?

On Thu, Jun 16, 2016 at 7:24 AM, spR <data.smar...@gmail.com> wrote:

> Hi,
> I am getting this error while executing a query using sqlcontext.sql
>
> The table has around 2.5 gb of data to be scanned.
>
> First I get out of memory exception. But I have 16 gb of ram
>
> Then my notebook dies and I get below error
>
> Py4JNetworkError: An error occurred while trying to connect to the Java server
>
>
> Thank You
>



-- 
Best Regards

Jeff Zhang

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang

quite a bit higher, so make sure you at least have a
>>>>> recent version; see SPARK-5395.)
>>>>>
>>>>> Each pyspark daemon tries to stay below the configured memory limit
>>>>> during aggregation (which is separate from the JVM heap as you note). 
>>>>> Since
>>>>> the number of daemons can be high and the memory limit is per daemon (each
>>>>> daemon is actually a process and not a thread and therefore has its own
>>>>> memory it tracks against the configured per-worker limit), I found memory
>>>>> depletion to be the main source of pyspark problems on larger data sets.
>>>>> Also, as Sea already noted the memory limit is not firm and individual
>>>>> daemons can grow larger.
>>>>>
>>>>> With that said, a run queue of 25 on a 16 core machine does not sound
>>>>> great but also not awful enough to knock it offline. I suspect something
>>>>> else may be going on. If you want to limit the amount of work running
>>>>> concurrently, try reducing spark.executor.cores (under normal 
>>>>> circumstances
>>>>> this would leave parts of your resources underutilized).
>>>>>
>>>>> Hope this helps!
>>>>> -Sven
>>>>>
>>>>>
>>>>> On Fri, Mar 25, 2016 at 10:41 AM, Carlile, Ken <
>>>>> carli...@janelia.hhmi.org> wrote:
>>>>>
>>>>>> Further data on this.
>>>>>> I’m watching another job right now where there are 16 pyspark.daemon
>>>>>> threads, all of which are trying to get a full core (remember, this is a 
>>>>>> 16
>>>>>> core machine). Unfortunately , the java process actually running the 
>>>>>> spark
>>>>>> worker is trying to take several cores of its own, driving the load up. 
>>>>>> I’m
>>>>>> hoping someone has seen something like this.
>>>>>>
>>>>>> —Ken
>>>>>>
>>>>>> On Mar 21, 2016, at 3:07 PM, Carlile, Ken <carli...@janelia.hhmi.org>
>>>>>> wrote:
>>>>>>
>>>>>> No further input on this? I discovered today that the pyspark.daemon
>>>>>> threadcount was actually 48, which makes a little more sense (at least 
>>>>>> it’s
>>>>>> a multiple of 16), and it seems to be happening at reduce and collect
>>>>>> portions of the code.
>>>>>>
>>>>>> —Ken
>>>>>>
>>>>>> On Mar 17, 2016, at 10:51 AM, Carlile, Ken <carli...@janelia.hhmi.org>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks! I found that part just after I sent the email… whoops. I’m
>>>>>> guessing that’s not an issue for my users, since it’s been set that way 
>>>>>> for
>>>>>> a couple of years now.
>>>>>>
>>>>>> The thread count is definitely an issue, though, since if enough
>>>>>> nodes go down, they can’t schedule their spark clusters.
>>>>>>
>>>>>> —Ken
>>>>>>
>>>>>> On Mar 17, 2016, at 10:50 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>> I took a look at docs/configuration.md
>>>>>> Though I didn't find answer for your first question, I think the
>>>>>> following pertains to your second question:
>>>>>>
>>>>>> 
>>>>>>   spark.python.worker.memory
>>>>>>   512m
>>>>>>   
>>>>>> Amount of memory to use per python worker process during
>>>>>> aggregation, in the same
>>>>>> format as JVM memory strings (e.g. 512m,
>>>>>> 2g). If the memory
>>>>>> used during aggregation goes above this amount, it will spill the
>>>>>> data into disks.
>>>>>>   
>>>>>> 
>>>>>>
>>>>>> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <
>>>>>> carli...@janelia.hhmi.org> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have an HPC cluster that we run Spark jobs on using standalone
>>>>>>> mode and a number of scripts I’ve built up to dynamically schedule and
>>>>>>> start spark clusters within the Grid Engine framework. Nodes in the 
>>>>>>> cluster
>>>>>>> have 16 cores and 128GB of RAM.
>>>>>>>
>>>>>>> My users use pyspark heavily. We’ve been having a number of problems
>>>>>>> with nodes going offline with extraordinarily high load. I was able to 
>>>>>>> look
>>>>>>> at one of those nodes today before it went truly sideways, and I 
>>>>>>> discovered
>>>>>>> that the user was running 50 pyspark.daemon threads (remember, this is 
>>>>>>> a 16
>>>>>>> core box), and the load was somewhere around 25 or so, with all CPUs 
>>>>>>> maxed
>>>>>>> out at 100%.
>>>>>>>
>>>>>>> So while the spark worker is aware it’s only got 16 cores and
>>>>>>> behaves accordingly, pyspark seems to be happy to overrun everything 
>>>>>>> like
>>>>>>> crazy. Is there a global parameter I can use to limit pyspark threads 
>>>>>>> to a
>>>>>>> sane number, say 15 or 16? It would also be interesting to set a memory
>>>>>>> limit, which leads to another question.
>>>>>>>
>>>>>>> How is memory managed when pyspark is used? I have the spark worker
>>>>>>> memory set to 90GB, and there is 8GB of system overhead (GPFS caching), 
>>>>>>> so
>>>>>>> if pyspark operates outside of the JVM memory pool, that leaves it at 
>>>>>>> most
>>>>>>> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB
>>>>>>> heap (ha ha.)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ken Carlile
>>>>>>> Sr. Unix Engineer
>>>>>>> HHMI/Janelia Research Campus
>>>>>>> 571-209-4363
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Т�ХF�
>>>>>> V�7V'67&�� R�� �â W6W"�V�7V'67&� 7 &�� 6�R��Фf�" FF�F��� � 6��� 
>>>>>> �G2�
>>>>>> R�� �â W6W"ֆV� 7 &�� 6�R��Р
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: sqlcontext - not able to connect to database

2016-06-14 Thread Jeff Zhang

The jdbc driver jar is not on classpath, please add it using --jars

On Wed, Jun 15, 2016 at 12:45 PM, Tejaswini Buche <
tejaswini.buche0...@gmail.com> wrote:

> hi,
>
> I am trying to connect to a mysql database on my machine.
> But, I am getting some error
>
> dataframe_mysql = sqlContext.read.format("jdbc").options(
> url="jdbc:mysql://localhost:3306/my_db",
> driver = "com.mysql.jdbc.Driver",
> dbtable = "data1",
> user="123").load()
>
>
> below is the full trace -
>
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()  4 dbtable = 
> "data1",  5 user="root",> 6 password="123").load()  7
> /Users/tejaV/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/readwriter.pyc
>  in load(self, path, format, schema, **options)137 return 
> self._df(self._jreader.load(path))138 else:--> 139 
> return self._df(self._jreader.load())140 141 @since(1.4)
> /Users/tejaV/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)811 answer = 
> self.gateway_client.send_command(command)812 return_value = 
> get_return_value(--> 813 answer, self.gateway_client, 
> self.target_id, self.name)814 815 for temp_arg in temp_args:
> /Users/tejaV/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45  
>return f(*a, **kw) 46 except 
> py4j.protocol.Py4JJavaError as e: 47 s = 
> e.java_exception.toString()
> /Users/tejaV/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error occurred 
> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
> ".", name), value)309 else:310 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o98.load.
> : java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:45)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:45)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:45)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang

Can you read this file using MR job ?

On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen <so...@cloudera.com> wrote:

> It's really the MR InputSplit code that splits files into records.
> Nothing particularly interesting happens in that process, except for
> breaking on newlines.
>
> Do you have one huge line in the file? are you reading as a text file?
> can you give any more detail about exactly how you parse it? it could
> be something else in your code.
>
> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren <sto...@gmail.com>
> wrote:
> > Hi
> >
> > We have log files that are written in base64 encoded text files
> > (gzipped) where each line is ended with a new line character.
> >
> > For some reason a particular line [1] is split by Spark [2] making it
> > unparsable by the base64 decoder. It does this consequently no matter
> > if I gives it the particular file that contain the line or a bunch of
> > files.
> >
> > I know the line is not corrupt because I can manually download the
> > file from HDFS, gunzip it and read/decode all the lines without
> > problems.
> >
> > Was thinking that maybe there is a limit to number of characters per
> > line but that doesn't sound right? Maybe the combination of characters
> > makes Spark think it's new line?
> >
> > I'm clueless.
> >
> > Cheers,
> > -Kristoffer
> >
> > [1] Original line:
> >
> >
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw===1465887564
> >
> >
> > [2] Line as spark hands it over:
> >
> >
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: java.io.FileNotFoundException

2016-06-03 Thread Jeff Zhang

One quick solution is to use spark 1.6.1.

On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar <akishore...@gmail.com> wrote:

> Could anyone help me on this issue ?
>
> On Tue, May 31, 2016 at 8:00 PM, kishore kumar <akishore...@gmail.com>
> wrote:
>
>> Hi,
>>
>> We installed spark1.2.1 in single node, running a job in yarn-client mode
>> on yarn which loads data into hbase and elasticsearch,
>>
>> the error which we are encountering is
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 38 in stage 26800.0 failed 4 times, most recent
>> failure: Lost task 38.3 in stage 26800.0 (TID 4990082, hdprd-c01-r04-03):
>> java.io.FileNotFoundException:
>> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
>> (No such file or directory)
>>
>> any idea about this error ?
>> --
>> Thanks,
>> Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>



-- 
Best Regards

Jeff Zhang

Bug of PolynomialExpansion ?

2016-05-29 Thread Jeff Zhang

I use PolynomialExpansion to convert one vector to 2-degree vector. I am
confused about the result of following. As my understanding, the 2 degrees
vector should contain 4 1's, not sure how the 5 1's come from. I think it
supposed to be (x1,x2,x3) *(x1,x2,x3) = (x1*x1, x1*x2, x1*x3, x2*x1,x2*x2,
x2*x3, x3*x1, x3*x2,x3*x3)

(3,[0,2],[1.0,1.0])  -->
(9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])|


-- 
Best Regards

Jeff Zhang

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang

Could you check the yarn app logs ?


On Wed, May 25, 2016 at 3:23 PM, <spark@yahoo.com.invalid> wrote:

> Hi,
>
> I am running spark streaming job on yarn-client mode. If run muliple jobs,
> some of the jobs failing and giving below error message. Is there any
> configuration missing?
>
> ERROR apache.spark.util.Utils - Uncaught exception in thread main
> java.lang.NullPointerException
> at
> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
> at
> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
> at org.apache.spark.SparkContext.(SparkContext.scala:593)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Yarn
> application has already ended! It might have been killed or unable to
> launch application master.
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO  apache.spark.storage.DiskBlockManager - Shutdown hook called
> INFO  apache.spark.util.ShutdownHookManager - Shutdown hook called
> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061
>
>
>
> Sent from Yahoo Mail. Get the app <https://yho.com/148vdq>
>



-- 
Best Regards

Jeff Zhang

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-19 Thread Jeff Zhang

I want to pass one custom hadoop conf to spark thrift server so that both
driver and executor side can get this conf. And I also want this custom
hadoop conf only detected by this user's job who set this conf.  Is it
possible for spark thrift server now ? Thanks



-- 
Best Regards

Jeff Zhang

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang

I think you can write it in gitbook and share it in user mail list then
everyone can comment on that.

On Wed, May 18, 2016 at 10:12 AM, Vinayak Agrawal <
vinayakagrawa...@gmail.com> wrote:

> Please include me too.
>
> Vinayak Agrawal
> Big Data Analytics
> IBM
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>
> On May 17, 2016, at 2:15 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi all,
>
> Many thanks for your tremendous interest in the forthcoming notes. I have
> had nearly thirty requests and many supporting kind words from the
> colleagues in this forum.
>
> I will strive to get the first draft ready as soon as possible. Apologies
> for not being more specific. However, hopefully not too long for your
> perusal.
>
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 May 2016 at 11:08, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Al,,
>>
>>
>> Following the threads in spark forum, I decided to write up on
>> configuration of Spark including allocation of resources and configuration
>> of driver, executors, threads, execution of Spark apps and general
>> troubleshooting taking into account the allocation of resources for Spark
>> applications and OS tools at the disposal.
>>
>> Since the most widespread configuration as I notice is with "Spark
>> Standalone Mode", I have decided to write these notes starting with
>> Standalone and later on moving to Yarn
>>
>>
>>-
>>
>>*Standalone *– a simple cluster manager included with Spark that
>>makes it easy to set up a cluster.
>>-
>>
>>*YARN* – the resource manager in Hadoop 2.
>>
>>
>> I would appreciate if anyone interested in reading and commenting to get
>> in touch with me directly on mich.talebza...@gmail.com so I can send the
>> write-up for their review and comments.
>>
>>
>> Just to be clear this is not meant to be any commercial proposition or
>> anything like that. As I seem to get involved with members troubleshooting
>> issues and threads on this topic, I thought it is worthwhile writing a note
>> about it to summarise the findings for the benefit of the community.
>>
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


-- 
Best Regards

Jeff Zhang

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang

Do you put your app jar on hdfs ? The app jar must be on your local
machine.

On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> hi, I'm trying to:
> 1. upload my app jar files to HDFS
> 2. run spark-submit with:
> 2.1. --master yarn --deploy-mode cluster
> or
> 2.2. --master yarn --deploy-mode client
>
> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar
>
> When spark job is submitted, SparkSubmit client outputs:
> Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar
> ...
>
> and then spark application main class fails with class not found exception.
> Is there any workaround?
>



-- 
Best Regards

Jeff Zhang

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang

I don't think this related with file system recovery.
spark.deploy.recoveryDirectory
is standalone configuration which take effect in standalone mode, but you
are in local mode.

Can you just start pyspark using "bin/pyspark --master local[4]" ?

On Wed, May 11, 2016 at 3:52 AM, Imran Akbar <skunkw...@gmail.com> wrote:

> I have some Python code that consistently ends up in this state:
>
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the
> Java server
> Traceback (most recent call last):
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 690, in start
> self.socket.connect((self.address, self.port))
>   File "/usr/lib/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the
> Java server
> Traceback (most recent call last):
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 690, in start
> self.socket.connect((self.address, self.port))
>   File "/usr/lib/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/home/ubuntu/spark/python/pyspark/sql/dataframe.py", line 280, in
> collect
> port = self._jdf.collectToPython()
>   File "/home/ubuntu/spark/python/pyspark/traceback_utils.py", line 78, in
> __exit__
> self._context._jsc.setCallSite(None)
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 811, in __call__
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 624, in send_command
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 579, in _get_connection
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 585, in _create_connection
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 697, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect
> to the Java server
>
> Even though I start pyspark with these options:
> ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g
> --packages com.databricks:spark-csv_2.11:1.4.0
> --spark.deploy.recoveryMode=FILESYSTEM
>
> and this in my /conf/spark-env.sh file:
> - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM
> -Dspark.deploy.recoveryDirectory=/user/recovery"
>
> How can I get HA to work in Spark?
>
> thanks,
> imran
>
>


-- 
Best Regards

Jeff Zhang

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang

The following sample code works for me. Could you share your code ?

df = DataFrame([1,2,3])
df_b=sc.broadcast(df)
def f(a):
  print(df_b.value)

sc.parallelize(range(1,10)).foreach(f)


On Sat, May 14, 2016 at 12:59 AM, abi <analyst.tech.j...@gmail.com> wrote:

> pandas dataframe is broadcasted successfully. giving errors in datanode
> function called kernel
>
> Code:
>
> dataframe_broadcast  = sc.broadcast(dataframe)
>
> def kernel():
> df_v = dataframe_broadcast.value
>
>
> Error:
>
> I get this error when I try accessing the value member of the broadcast
> variable. Apprently it does not have a value, hence it tries to load from
> the file again.
>
>   File
> "C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
> line 97, in value
> self._value = self.load(self._path)
>   File
> "C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
> line 88, in load
> return pickle.load(f)
> ImportError: No module named indexes.base
>
> at
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at
>
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-27 Thread Jeff Zhang

Could you check the log of executor to find the full stack trace ?

On Tue, Apr 26, 2016 at 12:30 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Hi,
>
> This JDBC connection was working fine in Spark 1.5,2
>
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val sqlContext = new HiveContext(sc)
> println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> //
> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb"
> var _username : String = "scratchpad"
> var _password : String = "xxx"
> //
> val s = HiveContext.load("jdbc",
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
> "user" -> _username,
> "password" -> _password))
>
> s.toDF.registerTempTable("tmp")
>
>
> // Need to create and populate target ORC table sales in database test in
> Hive
> //
> HiveContext.sql("use test")
> //
> // Drop and create table
> //
> HiveContext.sql("DROP TABLE IF EXISTS test.dummy2")
> var sqltext : String = ""
> sqltext = """
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> """
> HiveContext.sql(sqltext)
> //
> sqltext = """
> INSERT INTO TABLE test.dummy2
> SELECT
> *
> FROM tmp
> """
> HiveContext.sql(sqltext)
>
> In Spark 1.6.1, it is throwing error as below
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find
> registered driver with class oracle.jdbc.OracleDriver
>
> Is this a new bug introduced in Spark 1.6.1?
>
>
> Thanks
>



-- 
Best Regards

Jeff Zhang

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang

rote:
>>>>
>>>>> When submitting a job with spark-submit, I've observed delays (up to
>>>>> 1--2 seconds) for the executors to respond to the driver in order to
>>>>> receive tasks in the first stage. The delay does not persist once the
>>>>> executors have been synchronized.
>>>>>
>>>>> When the tasks are very short, as may be your case (relatively small
>>>>> data and a simple map task like you have described), the 8 tasks in
>>>>> your stage may be allocated to only 1 executor in 2 waves of 4, since
>>>>> the second executor won't have responded to the master before the
>>>>> first 4 tasks on the first executor have completed.
>>>>>
>>>>> To see if this is the cause in your particular case, you could try the
>>>>> following to confirm:
>>>>> 1. Examine the starting times of the tasks alongside their
>>>>> executor
>>>>> 2. Make a "dummy" stage execute before your real stages to
>>>>> synchronize the executors by creating and materializing any random RDD
>>>>> 3. Make the tasks longer, i.e. with some silly computational
>>>>> work.
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>> On 4/17/16, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote:
>>>>> > Yes its the same data.
>>>>> >
>>>>> > 1) The number of partitions are the same (8, which is an argument to
>>>>> the
>>>>> > HashPartitioner). In the first case, these partitions are spread
>>>>> across
>>>>> > both the worker nodes. In the second case, all the partitions are on
>>>>> the
>>>>> > same node.
>>>>> > 2) What resources would be of interest here? Scala shell takes the
>>>>> default
>>>>> > parameters since we use "bin/spark-shell --master " to
>>>>> run the
>>>>> > scala-shell. For the scala program, we do set some configuration
>>>>> options
>>>>> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo
>>>>> > serializer.
>>>>> >
>>>>> > We are running this on Azure D3-v2 machines which have 4 cores and
>>>>> 14GB
>>>>> > RAM.1 executor runs on each worker node. Following configuration
>>>>> options
>>>>> > are set for the scala program -- perhaps we should move it to the
>>>>> spark
>>>>> > config file.
>>>>> >
>>>>> > Driver memory and executor memory are set to 12GB
>>>>> > parallelism is set to 8
>>>>> > Kryo serializer is used
>>>>> > Number of retainedJobs and retainedStages has been increased to
>>>>> check them
>>>>> > in the UI.
>>>>> >
>>>>> > What information regarding Spark Context would be of interest here?
>>>>> >
>>>>> > Regards,
>>>>> > Raghava.
>>>>> >
>>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar <anujs...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> If the data file is same then it should have similar distribution of
>>>>> >> keys.
>>>>> >> Few queries-
>>>>> >>
>>>>> >> 1. Did you compare the number of partitions in both the cases?
>>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
>>>>> >> Program being submitted?
>>>>> >>
>>>>> >> Also, can you please share the details of Spark Context,
>>>>> Environment and
>>>>> >> Executors when you run via Scala program?
>>>>> >>
>>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>>>>> >> m.vijayaragh...@gmail.com> wrote:
>>>>> >>
>>>>> >>> Hello All,
>>>>> >>>
>>>>> >>> We are using HashPartitioner in the following way on a 3 node
>>>>> cluster (1
>>>>> >>> master and 2 worker nodes).
>>>>> >>>
>>>>> >>> val u =
>>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) =>
>>>>> (y.toInt,
>>>>> >>> x.toInt) } }).partitionBy(new
>>>>> HashPartitioner(8)).setName("u").persist()
>>>>> >>>
>>>>> >>> u.count()
>>>>> >>>
>>>>> >>> If we run this from the spark shell, the data (52 MB) is split
>>>>> across
>>>>> >>> the
>>>>> >>> two worker nodes. But if we put this in a scala program and run
>>>>> it, then
>>>>> >>> all the data goes to only one node. We have run it multiple times,
>>>>> but
>>>>> >>> this
>>>>> >>> behavior does not change. This seems strange.
>>>>> >>>
>>>>> >>> Is there some problem with the way we use HashPartitioner?
>>>>> >>>
>>>>> >>> Thanks in advance.
>>>>> >>>
>>>>> >>> Regards,
>>>>> >>> Raghava.
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Raghava
>>>>> > http://raghavam.github.io
>>>>> >
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Mike
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Raghava
>>>> http://raghavam.github.io
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Raghava
>> http://raghavam.github.io
>>
>


-- 
Best Regards

Jeff Zhang

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang

Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not
sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in
the forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 <kramer2...@126.com> wrote:

> Hi Jeff
>
> The total size of my data is less than 10M. I already set the driver
> memory to 4GB.
>
>
>
>
>
>
>
> 在 2016-04-20 13:42:25，"Jeff Zhang" <zjf...@gmail.com> 写道：
>
> Seems it is OOM in driver side when fetching task result.
>
> You can try to increase spark.driver.memory and spark.driver.maxResultSize
>
> On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 <kramer2...@126.com> wrote:
>
>> Hi Zhan Zhang
>>
>>
>> Please see the exception trace below. It is saying some GC overhead limit
>> error
>> I am not a java or scala developer so it is hard for me to understand
>> these infor.
>> Also reading coredump is too difficult to me..
>>
>> I am not sure if the way I am using spark is correct. I understand that
>> spark can do batch or stream calculation. But my way is to setup a forever
>> loop to handle continued income data.
>> Not sure if it is the right way to use spark
>>
>>
>> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
>> task-result-getter-2
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang

> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >-----
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> >commands, e-mail: user-h...@spark.apache.org
> >
> >Information transmitted by this e-mail is proprietary to Mphasis, its 
> >associated companies and/ or its customers and is intended
> >for use only by the individual or entity to which it is addressed, and may 
> >contain information that is privileged, confidential or
> >exempt from disclosure under applicable law. If you are not the intended 
> >recipient or it appears that this mail has been forwarded
> >to you without proper authority, you are notified that any use or 
> >dissemination of this information in any manner is strictly
> >prohibited. In such cases, please notify us immediately at 
> >mailmas...@mphasis.com and delete this mail from your records.
> >
>
>
>
>
>
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang

   at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:621)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:721)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1065)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1125)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:161)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:161)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> luca.rea tries to renew a token with renewer spark
> at
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:481)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:6793)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:635)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1005)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng
> at com.sun.proxy.$Proxy22.renewDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
> at com.sun.proxy.$Proxy23.renewDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:1145)
> ... 22 more
>
>
>
>
>
> Spark-defaults.conf :
>
> spark.yarn.principal spark-pantagr...@contactlab.lan
> spark.yarn.keytab /etc/security/keytabs/spark.headless.keytab
>
>
>
> core-site.xml:
>
> 
>   hadoop.proxyuser.spark.groups
>   *
> 
>
> 
>   hadoop.proxyuser.spark.hosts
>   *
> 
>
> ...
>
> 
>   hadoop.security.auth_to_local
>   
> RULE:[1:$1@$0](spark-pantagr...@contactlab.lan)s/.*/spark/
> DEFAULT
>  
> 
>
>
> "spark" is present as local user in all servers.
>
>
> What does is missing here ?
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Jeff Zhang

The table data is cached in block managers on executors.  Could you paste
the log on your driver about OOM ?

On Thu, Mar 31, 2016 at 1:24 PM, Soam Acharya <s...@altiscale.com> wrote:

> Hi folks,
>
> I understand that invoking sqlContext.cacheTable("tableName") will load
> the table into a compressed in-memory columnar format. When Spark is
> launched via spark shell in YARN client mode, is the table loaded into the
> local Spark driver process in addition to the executors in the Hadoop
> cluster or is it just loaded into the executors? We're exploring an OOM
> issue on the local Spark driver for some SQL code and was wondering if the
> local cache load could be the culprit.
>
> Appreciate any thoughts. BTW, we're running Spark 1.6.0 on this particular
> cluster.
>
> Regards,
>
> Soam
>



-- 
Best Regards

Jeff Zhang

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-29 Thread Jeff Zhang

l.Py4JJavaError as e: 47 s = 
> e.java_exception.toString()
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error occurred 
> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
> ".", name), value)309 else:310 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   
>
>


-- 
Best Regards

Jeff Zhang

Re: run spark job

2016-03-29 Thread Jeff Zhang

Yes you can. But this is actually what spark-submit does for you. Actually
spark-submit do more than that.
You can refer here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

What's your purpose for using "java -cp", for local development, IDE should
be sufficient.

On Tue, Mar 29, 2016 at 12:26 PM, Fei Hu <hufe...@gmail.com> wrote:

> Hi,
>
> I am wondering how to run the spark job by java command, such as: java -cp
> spark.jar mainclass. When running/debugging the spark program in IntelliJ
> IDEA, it uses java command to run spark main class, so I think it should be
> able to run the spark job by java command besides the spark-submit command.
>
> Thanks in advance,
> Fei
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Best Regards

Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang

Zhan's reply on stackoverflow is correct.

down vote

Please refer to the comments in sequenceFile.

/** Get an RDD for a Hadoop SequenceFile with given key and value types. *
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each * record, directly caching the returned RDD or directly
passing it to an aggregation or shuffle * operation will create many
references to the same object. * If you plan to directly cache, sort, or
aggregate Hadoop writable objects, you should first * copy them using
a map function.
*/

On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> I think I got the root cause, you can use Text.toString() to solve this
> issue.  Because the Text is shared so the last record display multiple
> times.
>
> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Looks like a spark bug. I can reproduce it for sequence file, but it
>> works for text file.
>>
>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
>> wrote:
>>
>>> Hi spark experts,
>>>
>>> I am facing issues with cached RDDs. I noticed that few entries
>>> get duplicated for n times when the RDD is cached.
>>>
>>> I asked a question on Stackoverflow with my code snippet to reproduce it.
>>>
>>> I really appreciate  if you can visit
>>> http://stackoverflow.com/q/36168827/1506477
>>> and answer my question / give your comments.
>>>
>>> Or at the least confirm that it is a bug.
>>>
>>> Thanks in advance for your help!
>>>
>>> --
>>> Thamme
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

-- 
Best Regards

Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang

I think I got the root cause, you can use Text.toString() to solve this
issue.  Because the Text is shared so the last record display multiple
times.

On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> Looks like a spark bug. I can reproduce it for sequence file, but it works
> for text file.
>
> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
> wrote:
>
>> Hi spark experts,
>>
>> I am facing issues with cached RDDs. I noticed that few entries
>> get duplicated for n times when the RDD is cached.
>>
>> I asked a question on Stackoverflow with my code snippet to reproduce it.
>>
>> I really appreciate  if you can visit
>> http://stackoverflow.com/q/36168827/1506477
>> and answer my question / give your comments.
>>
>> Or at the least confirm that it is a bug.
>>
>> Thanks in advance for your help!
>>
>> --
>> Thamme
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang

Looks like a spark bug. I can reproduce it for sequence file, but it works
for text file.

On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> wrote:

> Hi spark experts,
>
> I am facing issues with cached RDDs. I noticed that few entries
> get duplicated for n times when the RDD is cached.
>
> I asked a question on Stackoverflow with my code snippet to reproduce it.
>
> I really appreciate  if you can visit
> http://stackoverflow.com/q/36168827/1506477
> and answer my question / give your comments.
>
> Or at the least confirm that it is a bug.
>
> Thanks in advance for your help!
>
> --
> Thamme
>



-- 
Best Regards

Jeff Zhang

Re: DataFrame vs RDD

2016-03-22 Thread Jeff Zhang

Please check the offical doc

http://spark.apache.org/docs/latest/


On Wed, Mar 23, 2016 at 10:08 AM, asethia <sethia.a...@gmail.com> wrote:

> Hi,
>
> I am new to Spark, would like to know any guidelines when to use Data Frame
> vs. RDD.
>
> Thanks,
> As
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-vs-RDD-tp26570.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang

Please try export PYSPARK_PYTHON=

On Wed, Mar 16, 2016 at 3:00 PM, ram kumar <ramkumarro...@gmail.com> wrote:

> Hi,
>
> I get the following error when running a job as pyspark,
>
> {{{
>  An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 6, ): java.io.IOException: Cannot run program "python2.7":
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:248)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 36 more
>
> }}}
>
> python2.7 couldn't found. But i m using vertual env python 2.7
> {{{
> [ram@test-work workspace]$ python
> Python 2.7.8 (default, Mar 15 2016, 04:37:00)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> }}}
>
> Can anyone help me with this?
> Thanks
>



-- 
Best Regards

Jeff Zhang

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang

ark.deploy.yarn.ApplicationMaster$.main
> (ApplicationMaster.scala:651)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
> (ApplicationMaster.scala:674)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher.main
> (ApplicationMaster.scala)
> 16/03/15 17:56:26 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException:
> Failed to connect to driver!)
> 16/03/15 17:56:26 INFO util.ShutdownHookManager: Shutdown hook called
>
> Best regards,
>
> S.Y. Chung 鍾學毅
> F14MITD
> Taiwan Semiconductor Manufacturing Company, Ltd.
> Tel: 06-5056688 Ext: 734-6325
>
>  ---
>  TSMC PROPERTY
>  This email communication (and any attachments) is proprietary information
>  for the sole use of its
>  intended recipient. Any unauthorized review, use or distribution by anyone
>  other than the intended
>  recipient is strictly prohibited.  If you are not the intended recipient,
>  please notify the sender by
>  replying to this email, and then delete this email and any copies of it
>  immediately. Thank you.
>
>  ---
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

It's same as hive thrift server. I believe kerberos is supported.

On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote:

> so, how about implementing security? Any pointer will be helpful
>
> On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> The spark thrift server allow you to run hive query in spark engine. It
>> can be used as jdbc server.
>>
>> On Wed, Mar 16, 2016 at 10:42 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Sorry to be dumb-head today, but what is the purpose of spark
>>> thriftserver then? In other words, should I view spark thriftserver as a
>>> better version of hive one (with Spark as execution engine instead of
>>> MR/Tez) OR should I see it as a JDBC server?
>>>
>>> On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> spark thrift server is very similar with hive thrift server. You can
>>>> use hive jdbc driver to access spark thrift server. AFAIK, all the features
>>>> of hive thrift server are also available in spark thrift server.
>>>>
>>>> On Wed, Mar 16, 2016 at 8:39 AM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> My understanding about thriftserver is to use it to expose pre-loaded
>>>>> RDD/dataframes to tools who can connect through JDBC.
>>>>>
>>>>> Is there something like Spark JDBC server too? Does it do the same
>>>>> thing? What is the difference between these two?
>>>>>
>>>>> How does Spark JDBC/Thrift supports security? Can we restrict certain
>>>>> users to access certain dataframes and not the others?
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards

Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

The spark thrift server allow you to run hive query in spark engine. It can
be used as jdbc server.

On Wed, Mar 16, 2016 at 10:42 AM, ayan guha <guha.a...@gmail.com> wrote:

> Sorry to be dumb-head today, but what is the purpose of spark thriftserver
> then? In other words, should I view spark thriftserver as a better version
> of hive one (with Spark as execution engine instead of MR/Tez) OR should I
> see it as a JDBC server?
>
> On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> spark thrift server is very similar with hive thrift server. You can use
>> hive jdbc driver to access spark thrift server. AFAIK, all the features of
>> hive thrift server are also available in spark thrift server.
>>
>> On Wed, Mar 16, 2016 at 8:39 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi All
>>>
>>> My understanding about thriftserver is to use it to expose pre-loaded
>>> RDD/dataframes to tools who can connect through JDBC.
>>>
>>> Is there something like Spark JDBC server too? Does it do the same
>>> thing? What is the difference between these two?
>>>
>>> How does Spark JDBC/Thrift supports security? Can we restrict certain
>>> users to access certain dataframes and not the others?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards

Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang

spark thrift server is very similar with hive thrift server. You can use
hive jdbc driver to access spark thrift server. AFAIK, all the features of
hive thrift server are also available in spark thrift server.

On Wed, Mar 16, 2016 at 8:39 AM, ayan guha <guha.a...@gmail.com> wrote:

> Hi All
>
> My understanding about thriftserver is to use it to expose pre-loaded
> RDD/dataframes to tools who can connect through JDBC.
>
> Is there something like Spark JDBC server too? Does it do the same thing?
> What is the difference between these two?
>
> How does Spark JDBC/Thrift supports security? Can we restrict certain
> users to access certain dataframes and not the others?
>
> --
> Best Regards,
> Ayan Guha
>

-- 
Best Regards

Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang

Right, it is a little confusing here. dropTempTable actually means
unregister here. It only deletes the metadata of this table from catalog.
But you can still operate this table by using its dataframe.

On Wed, Mar 16, 2016 at 8:27 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Thanks Jeff
>
> I was looking for something like ‘unregister’
>
>
> In SQL you use drop to delete a table. I always though register was a
> strange function name.
>
> Register **-1 = unregister
> createTable **-1 == dropTable
>
> Andy
>
> From: Jeff Zhang <zjf...@gmail.com>
> Date: Tuesday, March 15, 2016 at 4:44 PM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: what is the pyspark inverse of registerTempTable()?
>
> >>> sqlContext.registerDataFrameAsTable(df, "table1")
> >>> sqlContext.dropTempTable("table1")
>
>
>
> On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> Thanks
>>
>> Andy
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>


-- 
Best Regards

Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> sqlContext.dropTempTable("table1")



On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Thanks
>
> Andy
>



-- 
Best Regards

Jeff Zhang

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang

Spark will skip the stage if it is computed by other jobs. That means the
common parent RDD of each job only needs to be computed once. But it is
still multiple sequential jobs, not concurrent jobs.

On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba <i...@jansterba.com> wrote:

> Hi Andy,
>
> its nice to see that we are not the only ones with the same issues. So
> far we have not gone as far as you have. What we have done is that we
> cache whatever dataframes/rdds are shared foc computing different
> output. This has brought us quite the speedup, but we still see that
> saving some large output blocks all other computation even though the
> save uses only one executor and rest of the cluster is just waiting.
>
> I was thinking about trying something similar to what you are
> describing in 1) but I am sad to see it is riddled with bugs and to me
> it seems like going againts spark in a way.
>
> Hope someone can help in resolving this.
>
> Cheers.
>
> Jan
> --
> Jan Sterba
> https://twitter.com/honzasterba | http://flickr.com/honzasterba |
> http://500px.com/honzasterba
>
>
> On Wed, Mar 9, 2016 at 2:31 AM, Andy Sloane <a...@a1k0n.net> wrote:
> > We have a somewhat complex pipeline which has multiple output files on
> HDFS,
> > and we'd like the materialization of those outputs to happen
> concurrently.
> >
> > Internal to Spark, any "save" call creates a new "job", which runs
> > synchronously -- that is, the line of code after your save() executes
> once
> > the job completes, executing the entire dependency DAG to produce it.
> Same
> > with foreach, collect, count, etc.
> >
> > The files we want to save have overlapping dependencies. For us to create
> > multiple outputs concurrently, we have a few options that I can see:
> >  - Spawn a thread for each file we want to save, letting Spark run the
> jobs
> > somewhat independently. This has the downside of various concurrency bugs
> > (e.g. SPARK-4454, and more recently SPARK-13631) and also causes RDDs up
> the
> > dependency graph to get independently, uselessly recomputed.
> >  - Analyze our own dependency graph, materialize (by checkpointing or
> > saving) common dependencies, and then executing the two saves in threads.
> >  - Implement our own saves to HDFS as side-effects inside mapPartitions
> > (which is how save actually works internally anyway, modulo committing
> logic
> > to handle speculative execution), yielding an empty dummy RDD for each
> thing
> > we want to save, and then run foreach or count on the union of all the
> dummy
> > RDDs, which causes Spark to schedule the entire DAG we're interested in.
> >
> > Currently we are doing a little of #1 and a little of #3, depending on
> who
> > originally wrote the code. #2 is probably closer to what we're supposed
> to
> > be doing, but IMO Spark is already able to produce a good execution plan
> and
> > we shouldn't have to do that.
> >
> > AFAIK, there's no way to do what I *actually* want in Spark, which is to
> > have some control over which saves go into which jobs, and then execute
> the
> > jobs directly. I can envision a new version of the various save functions
> > which take an extra job argument, or something, or some way to defer and
> > unblock job creation in the spark context.
> >
> > Ideas?
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang

Any reason why do you broadcast such large variable ? It doesn't make sense
to me

On Tue, Mar 8, 2016 at 7:29 AM, Arash <aras...@gmail.com> wrote:

> Hello all,
>
> I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes
> but haven't been able to make it work so far.
>
> It looks like the executors start to run out of memory during
> deserialization. This behavior only shows itself when the number of
> partitions is above a few 10s, the broadcast does work for 10 or 20
> partitions.
>
> I'm using the following setup to observe the problem:
>
> val tuples: Array[((String, String), (String, String))]  // ~ 10M
> tuples
> val tuplesBc = sc.broadcast(tuples)
> val numsRdd = sc.parallelize(1 to 5000, 100)
> numsRdd.map(n => tuplesBc.value.head).count()
>
> If I set the number of partitions for numsRDD to 20, the count goes
> through successfully, but at 100, I'll start to get errors such as:
>
> 16/03/07 19:35:32 WARN scheduler.TaskSetManager: Lost task 77.0 in stage
> 1.0 (TID 1677, xxx.ec2.internal): java.lang.OutOfMemoryError: Java heap
> space
> at
> java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3472)
> at
> java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3278)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>
>
> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark
> property maximizeResourceAllocation is set to true (executor.memory = 48G
> according to spark ui environment). We're also using kryo serialization and
> Yarn is the resource manager.
>
> Any ideas as what might be going wrong and how to debug this?
>
> Thanks,
> Arash
>
>


-- 
Best Regards

Jeff Zhang

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang

I can reproduce it in spark-shell. But it works for batch job. Looks like
spark repl issue.

On Thu, Mar 3, 2016 at 10:43 AM, Rahul Palamuttam <rahulpala...@gmail.com>
wrote:

> Hi All,
>
> We recently came across this issue when using the spark-shell and zeppelin.
> If we assign the sparkcontext variable (sc) to a new variable and reference
> another variable in an RDD lambda expression we get a task not
> serializable exception.
>
> The following three lines of code illustrate this :
>
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp).
>
> I am not sure if this is a known issue, or we should file a JIRA for it.
> We originally came across this bug in the SciSpark project.
>
> Best,
>
> Rahul P
>



-- 
Best Regards

Jeff Zhang

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang

The executor may fail to start. You need to check the executor logs, if
there's no executor log then you need to check node manager log.

On Wed, Mar 2, 2016 at 4:26 PM, Xiaoye Sun <sunxiaoy...@gmail.com> wrote:

> Hi all,
>
> I am very new to spark and yarn.
>
> I am running a BroadcastTest example application using spark 1.6.0 and
> Hadoop/Yarn 2.7.1. in a 5 nodes cluster.
>
> I configured my configuration files according to
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> 1. copy
> ./spark-1.6.0/network/yarn/target/scala-2.10/spark-1.6.0-yarn-shuffle.jar
> to /hadoop-2.7.1/share/hadoop/yarn/lib/
> 2. yarn-site.xml is like this
> http://www.owlnet.rice.edu/~xs6/yarn-site.xml
> 3. spark-defaults.conf is like this
> http://www.owlnet.rice.edu/~xs6/spark-defaults.conf
> 4. spark-env.sh is like this http://www.owlnet.rice.edu/~xs6/spark-env.sh
> 5. the command I use to submit spark application is: ./bin/spark-submit
> --class org.apache.spark.examples.BroadcastTest --master yarn --deploy-mode
> cluster ./examples/target/spark-examples_2.10-1.6.0.jar 1 1000 Http
>
> However, the job is stuck at RUNNING status, and by looking at the log, I
> found that the executor is failed/cancelled frequently...
> Here is the log output http://www.owlnet.rice.edu/~xs6/stderr
> It shows something like
>
> 16/03/02 02:07:35 WARN yarn.YarnAllocator: Container marked as failed: 
> container_1456905762620_0002_01_02 on host: bold-x.rice.edu. Exit status: 
> 1. Diagnostics: Exception from container-launch.
>
>
> Is there anybody know what is the problem here?
> Best,
> Xiaoye
>



-- 
Best Regards

Jeff Zhang

Re: Spark executor killed without apparent reason

2016-03-01 Thread Jeff Zhang

channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>> at
>>> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
>>> at
>>> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>> at java.lang.Thread.run(Thread.java:744)
>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch
>>> (1/3) for 6 outstanding blocks after 5000 ms
>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
>>> connection to maprnode5, creating a new one.
>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in
>>> connection from maprnode5
>>> java.io.IOException: Connection reset by peer
>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>> at
>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>>> at
>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>>> at
>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>>> at
>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>> at java.lang.Thread.run(Thread.java:744)
>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1
>>> requests outstanding when connection from maprnode5 is closed
>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while
>>> starting block fetches
>>>
>>> Also, I can see lot of ExecutorLost error in Driver logs but I can'\t
>>> seem to connect it to Above executor logs. I think executor logs should at
>>> least have mentioning of executor ID/task ID (EID-TID) and not just task ID
>>> (TID).
>>>
>>> this is snippet of driver logs from ui:
>>>
>>> 189 15283 0 FAILED PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 / 
>>> ExecutorLostFailure
>>> (executor 15 lost)
>>>
>>> here we can see executor id is 5 but executor logs itself doesn't use
>>> this id as reference in log stream so it's hard to cross check logs.
>>>
>>>
>>> Anyhow my main issue is to determine cause of executor killing.
>>>
>>>
>>> Thanks
>>>
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>



-- 
Best Regards

Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang

Check the code again. Looks like currently the task result will be loaded
into memory no matter it is DirectTaskResult or InDirectTaskResult.
Previous I thought InDirectTaskResult can be loaded into memory later which
can save memory, RDD#collectAsIterator is what I thought that may save
memory.

On Tue, Mar 1, 2016 at 5:00 PM, Reynold Xin <r...@databricks.com> wrote:

> How big of a deal is this though? If I am reading your email correctly,
> either way this job will fail. You simply want it to fail earlier in the
> executor side, rather than collecting it and fail on the driver side?
>
>
> On Sunday, February 28, 2016, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> data skew might be possible, but not the common case. I think we should
>> design for the common case, for the skew case, we may can set some
>> parameter of fraction to allow user to tune it.
>>
>> On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> But sometimes you might have skew and almost all the result data are in
>>> one or a few tasks though.
>>>
>>>
>>> On Friday, February 26, 2016, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>>
>>>> My job get this exception very easily even when I set large value of
>>>> spark.driver.maxResultSize. After checking the spark code, I found
>>>> spark.driver.maxResultSize is also used in Executor side to decide whether
>>>> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me.
>>>> Using  spark.driver.maxResultSize / taskNum might be more proper. Because
>>>> if  spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m
>>>> output. Then even the output of each task is less than
>>>>  spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but
>>>> the total result size is 2g which will cause exception in driver side.
>>>>
>>>>
>>>> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at
>>>> LogisticRegression.scala:283, took 33.796379 s
>>>>
>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>>>> due to stage failure: Total size of serialized results of 1 tasks (1085.0
>>>> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang

Re: Converting array to DF

2016-03-01 Thread Jeff Zhang

Change Array to Seq and import sqlContext.implicits._



On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> Hi,
>
> I have this
>
> val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9),
> ("f", 4), ("g", 6))
> weights.toDF("weights","value")
>
> I want to convert the Array to DF but I get thisor
>
> weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9),
> (f,4), (g,6))
> :33: error: value toDF is not a member of Array[(String, Int)]
>   weights.toDF("weights","value")
>
> I want to label columns and print out the contents in value order please I
> don't know why I am getting this error
>
> Thanks
>
>


-- 
Best Regards

Jeff Zhang

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang

I may not express it clearly. This method is trying to create virtualenv
before python worker start, and this virtualenv is application scope, after
the spark application job finish, the virtualenv will be cleanup. And the
virtualenvs don't need to be the same path for each node (In my POC, it is
the yarn container working directory). So that means user don't need to
manually install packages on each node (sometimes you even can't install
packages on cluster due to security reason). This is the biggest benefit
and purpose that user can create virtualenv on demand without touching each
node even when you are not administrator.  The cons is the extra cost for
installing the required packages before starting python worker. But if it
is an application which will run for several hours then the extra cost can
be ignored.

On Tue, Mar 1, 2016 at 4:15 PM, Mohannad Ali <man...@gmail.com> wrote:

> Hello Jeff,
>
> Well this would also mean that you have to manage the same virtualenv
> (same path) on all nodes and install your packages to it the same way you
> would if you would install the packages to the default python path.
>
> In any case at the moment you can already do what you proposed by creating
> identical virtualenvs on all nodes on the same path and change the spark
> python path to point to the virtualenv.
>
> Best Regards,
> Mohannad
> On Mar 1, 2016 06:07, "Jeff Zhang" <zjf...@gmail.com> wrote:
>
>> I have created jira for this feature , comments and feedback are welcome
>> about how to improve it and whether it's valuable for users.
>>
>> https://issues.apache.org/jira/browse/SPARK-13587
>>
>>
>> Here's some background info and status of this work.
>>
>>
>> Currently, it's not easy for user to add third party python packages in
>> pyspark.
>>
>>- One way is to using --py-files (suitable for simple dependency, but
>>not suitable for complicated dependency, especially with transitive
>>dependency)
>>- Another way is install packages manually on each node (time
>>wasting, and not easy to switch to different environment)
>>
>> Python now has 2 different virtualenv implementation. One is native
>> virtualenv another is through conda.
>>
>> I have implemented POC for this features. Here's one simple command for
>> how to use virtualenv in pyspark
>>
>> bin/spark-submit --master yarn --deploy-mode client --conf 
>> "spark.pyspark.virtualenv.enabled=true" --conf 
>> "spark.pyspark.virtualenv.type=conda" --conf 
>> "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
>>  --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
>> ~/work/virtualenv/spark.py
>>
>> There're 4 properties needs to be set
>>
>>- spark.pyspark.virtualenv.enabled (enable virtualenv)
>>    - spark.pyspark.virtualenv.type (native/conda are supported, default
>>is native)
>>- spark.pyspark.virtualenv.requirements (requirement file for the
>>dependencies)
>>- spark.pyspark.virtualenv.path (path to the executable for for
>>virtualenv/conda)
>>
>>
>>
>>
>>
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang

Re: Save DataFrame to Hive Table

2016-02-29 Thread Jeff Zhang

.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang

I have created jira for this feature , comments and feedback are welcome
about how to improve it and whether it's valuable for users.

https://issues.apache.org/jira/browse/SPARK-13587


Here's some background info and status of this work.


Currently, it's not easy for user to add third party python packages in
pyspark.

   - One way is to using --py-files (suitable for simple dependency, but
   not suitable for complicated dependency, especially with transitive
   dependency)
   - Another way is install packages manually on each node (time wasting,
   and not easy to switch to different environment)

Python now has 2 different virtualenv implementation. One is native
virtualenv another is through conda.

I have implemented POC for this features. Here's one simple command for how
to use virtualenv in pyspark

bin/spark-submit --master yarn --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=conda" --conf
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"
 ~/work/virtualenv/spark.py

There're 4 properties needs to be set

   - spark.pyspark.virtualenv.enabled (enable virtualenv)
   - spark.pyspark.virtualenv.type (native/conda are supported, default is
   native)
   - spark.pyspark.virtualenv.requirements (requirement file for the
   dependencies)
   - spark.pyspark.virtualenv.path (path to the executable for for
   virtualenv/conda)






Best Regards

Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang

data skew might be possible, but not the common case. I think we should
design for the common case, for the skew case, we may can set some
parameter of fraction to allow user to tune it.

On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin <r...@databricks.com> wrote:

> But sometimes you might have skew and almost all the result data are in
> one or a few tasks though.
>
>
> On Friday, February 26, 2016, Jeff Zhang <zjf...@gmail.com> wrote:
>
>>
>> My job get this exception very easily even when I set large value of
>> spark.driver.maxResultSize. After checking the spark code, I found
>> spark.driver.maxResultSize is also used in Executor side to decide whether
>> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me.
>> Using  spark.driver.maxResultSize / taskNum might be more proper. Because
>> if  spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m
>> output. Then even the output of each task is less than
>>  spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but
>> the total result size is 2g which will cause exception in driver side.
>>
>>
>> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at
>> LogisticRegression.scala:283, took 33.796379 s
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Total size of serialized results of 1 tasks (1085.0
>> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang

Re: PySpark : couldn't pickle object of type class T

2016-02-28 Thread Jeff Zhang

Hi Anoop,

I don't see the exception you mentioned in the link. I can use spark-avro
to read the sample file users.avro in spark successfully. Do you have the
details of the union issue ?



On Sat, Feb 27, 2016 at 10:05 AM, Anoop Shiralige <anoop.shiral...@gmail.com
> wrote:

> Hi Jeff,
>
> Thank you for looking into the post.
>
> I had explored spark-avro option earlier. Since, we have union of multiple
> complex data types in our avro schema we couldn't use it.
> Couple of things I tried.
>
>-
>
> https://stackoverflow.com/questions/31261376/how-to-read-pyspark-avro-file-and-extract-the-values
>  :
>"Spark Exception : Unions may only consist of concrete type and null"
>- Use of dataFrame/DataSet : serialization problem.
>
> For now, I got it working by modifing AvroConversionUtils, to address the
> union of multiple data-types.
>
> Thanks,
> AnoopShiralige
>
>
> On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Avro Record is not supported by pickler, you need to create a custom
>> pickler for it.  But I don't think it worth to do that. Actually you can
>> use package spark-avro to load avro data and then convert it to RDD if
>> necessary.
>>
>> https://github.com/databricks/spark-avro
>>
>>
>> On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <
>> anoop.shiral...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
>>> JavaRDD[org.apache.avro.GenericRecord] which I have converted to
>>> pythonRDD
>>> in the following way.
>>>
>>> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
>>> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
>>> from pyspark.rdd import RDD
>>> pythonRDD=RDD(javaPython,sc)
>>>
>>> pythonRDD.first()
>>>
>>> However everytime I am trying to call collect() or first() method on
>>> pythonRDD I am getting the following error :
>>>
>>> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
>>> unexpectedly (crashed)
>>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>>> last):
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
>>> line 98, in main
>>> command = pickleSer._read_with_length(infile)
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 156, in _read_with_length
>>> length = read_int(stream)
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 545, in read_int
>>> raise EOFError
>>> EOFError
>>>
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>>> at
>>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>>> at
>>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object
>>> of
>>> type class org.apache.avro.generic.GenericData$Record
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
>>> at
>>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
>>> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
>>> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
>>>

Re: Dynamic allocation Spark

2016-02-26 Thread Jeff Zhang

Check the RM UI to ensure you have available resources. I suspect it might
be that you didn't configure yarn correctly, so NM didn't start properly
and you have no resource.

On Fri, Feb 26, 2016 at 7:14 PM, alvarobrandon <alvarobran...@gmail.com>
wrote:

> Hello everyone:
>
> I'm trying the dynamic allocation in Spark with YARN. I have followed the
> following configuration steps:
> 1. Copy the spark-*-yarn-shuffle.jar to the nodemanager classpath. "cp
> /opt/spark/lib/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn"
> 2. Added the shuffle service of spark in yarn-site.xml
> 
> yarn.nodemanager.aux-services
> mapreduce_shuffle,spark_shuffle
> shuffle implementation
>   
> 3. Enabled the class for the shuffle service in yarn-site.xml
>   
> yarn.nodemanager.aux-services.spark_shuffle.class
> org.apache.spark.network.yarn.YarnShuffleService
> enable the class for dynamic allocation
>   
> 4. Activated the dynamic allocation in the spark defaults
> spark.dynamicAllocation.enabled true
> spark.shuffle.service.enabled   true
>
> When I launch my application it just stays in the queue accepted but it
> never actually runs.
> 16/02/26 11:11:46 INFO yarn.Client: Application report for
> application_1456482268159_0001 (state: ACCEPTED)
>
> Am I missing something?
>
> Thanks in advance as always
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dynamic-allocation-Spark-tp26344.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: No event log in /tmp/spark-events

2016-02-26 Thread Jeff Zhang

If event log is enabled, there should be log like following. But I don't
see it in your log.

16/02/26 19:10:01 INFO EventLoggingListener: Logging events to
file:///Users/jzhang/Temp/spark-events/local-1456485001491

Could you add "--verbose" in spark-submit command to check whether your
configuration is picked up correct ?

On Fri, Feb 26, 2016 at 7:01 PM, alvarobrandon <alvarobran...@gmail.com>
wrote:

> Just write /tmp/sparkserverlog without the file part.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26343.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Best Regards

Jeff Zhang

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang

My job get this exception very easily even when I set large value of
spark.driver.maxResultSize. After checking the spark code, I found
spark.driver.maxResultSize is also used in Executor side to decide whether
DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me.
Using  spark.driver.maxResultSize / taskNum might be more proper. Because
if  spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m
output. Then even the output of each task is less than
 spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but
the total result size is 2g which will cause exception in driver side.


16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at
LogisticRegression.scala:283, took 33.796379 s

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Total size of serialized results of 1 tasks (1085.0 MB)
is bigger than spark.driver.maxResultSize (1024.0 MB)


-- 
Best Regards

Jeff Zhang

Re: When I merge some datas,can't go on...

2016-02-26 Thread Jeff Zhang

rdd.map(e=>e.split("\\s")).map(e=>(e(0),e(1))).groupByKey()

On Fri, Feb 26, 2016 at 3:20 PM, Bonsen <hengbohe...@126.com> wrote:

> I have a file,like 1.txt:
> 1 2
> 1 3
> 1 4
> 1 5
> 1 6
> 1 7
> 2 4
> 2 5
> 2 7
> 2 9
>
> I want to merge them,results like this
> map（1->List（2，3，4，5，6，7）,2->List(4,5,7,9))
> what should I do？。。
> val file1=sc.textFile("1.txt")
> val q1=file1.flatMap(_.split(' '))???，maybe I should change RDD[int] to
> RDD[int,int]?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/When-I-merge-some-datas-can-t-go-on-tp26341.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: PySpark : couldn't pickle object of type class T

2016-02-24 Thread Jeff Zhang

Avro Record is not supported by pickler, you need to create a custom
pickler for it.  But I don't think it worth to do that. Actually you can
use package spark-avro to load avro data and then convert it to RDD if
necessary.

https://github.com/databricks/spark-avro


On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <anoop.shiral...@gmail.com
> wrote:

> Hi All,
>
> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
> JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD
> in the following way.
>
> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
> from pyspark.rdd import RDD
> pythonRDD=RDD(javaPython,sc)
>
> pythonRDD.first()
>
> However everytime I am trying to call collect() or first() method on
> pythonRDD I am getting the following error :
>
> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
> unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
> line 98, in main
> command = pickleSer._read_with_length(infile)
>   File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
> line 156, in _read_with_length
> length = read_int(stream)
>   File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
> line 545, in read_int
> raise EOFError
> EOFError
>
> at
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at
>
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object of
> type class org.apache.avro.generic.GenericData$Record
> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
> at
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
> at
>
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
> at
>
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
> at
>
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>
> Thanks for your time,
> AnoopShiralige
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: which master option to view current running job in Spark UI

2016-02-23 Thread Jeff Zhang

View running job in SPARK UI doesn't matter which master you use.  What do
you mean "I cant see the currently running jobs in Spark WEB UI" ? Do you
see a blank spark ui or can't open the spark ui ?

On Mon, Feb 15, 2016 at 12:55 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> When running in YARN, you can use the YARN Resource Manager UI to get to
> the ApplicationMaster url, irrespective of client or cluster mode.
>
> Regards
> Sab
> On 15-Feb-2016 10:10 am, "Divya Gehlot" <divya.htco...@gmail.com> wrote:
>
>> Hi,
>> I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala
>> files .
>> I am bit confused between using *master  *options
>> I want to execute this spark job in YARN
>>
>> Curently running as
>> spark-shell --properties-file  /TestDivya/Spark/Oracle.properties --jars
>> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path
>> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages
>> com.databricks:spark-csv_2.10:1.1.0  *--master yarn-client *  -i
>> /TestDivya/Spark/Test.scala
>>
>> with this option I cant see the currently running jobs in Spark WEB UI
>> though it later appear in spark history server.
>>
>> My question with which --master option should I run my spark jobs so that
>> I can view the currently running jobs in spark web UI .
>>
>> Thanks,
>> Divya
>>
>


-- 
Best Regards

Jeff Zhang

Re: Joining three tables with data frames

2016-02-13 Thread Jeff Zhang

What do you mean "does not work" ? What's the error message ? BTW would it
be simpler that register the 3 data frames as temporary table and then use
the sql query you used before in hive and oracle ?

On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> I have created DFs on three Oracle tables.
>
>
>
> The join in Hive and Oracle are pretty simple
>
>
>
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
> TotalSales
>
> FROM sales s, times t, channels c
>
> WHERE s.time_id = t.time_id
>
> AND   s.channel_id = c.channel_id
>
> GROUP BY t.calendar_month_desc, c.channel_desc
>
> ;
>
>
>
> I try to do this using Data Framess
>
>
>
>
>
> import org.apache.spark.sql.functions._
>
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> //
>
> val s = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(sh.sales)",
>
> "user" -> "sh",
>
> "password" -> "x"))
>
> //
>
> val c = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(sh.channels)",
>
> "user" -> "sh",
>
> "password" -> "x"))
>
>
>
> val t = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(sh.times)",
>
> "user" -> "sh",
>
> "password" -> "x"))
>
> //
>
> val sc = s.join(c, s.col("CHANNEL_ID") === c.col("CHANNEL_ID"))
>
> val st = s.join(t, s.col("TIME_ID") === t.col("TIME_ID"))
>
>
>
> val rs = sc.join(st)
>
>
>
> rs.groupBy($"calendar_month_desc",$"channel_desc").agg(sum($"amount_sold"))
>
>
>
> The las result set (rs) does not work.
>
>
>
> Since data is imported then I assume that the columns for joins need to be
> defined in data frame for each table rather than importing the whole
> columns.
>
>
>
> Thanks,
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Re: corresponding sql for query against LocalRelation

2016-01-27 Thread Jeff Zhang

I think LocalRelation is used for dataframe,

>> val df = sqlContext.createDataFrame(Seq((1,"jeff"),(2, "andy")))
>> df.explain(true)

== Parsed Logical Plan ==
LocalRelation [_1#0,_2#1], [[1,jeff],[2,andy]]

== Analyzed Logical Plan ==
_1: int, _2: string
LocalRelation [_1#0,_2#1], [[1,jeff],[2,andy]]

== Optimized Logical Plan ==
LocalRelation [_1#0,_2#1], [[1,jeff],[2,andy]]

== Physical Plan ==
LocalTableScan [_1#0,_2#1], [[1,jeff],[2,andy]]

On Thu, Jan 28, 2016 at 10:18 AM, ey-chih chow <eyc...@hotmail.com> wrote:

> Hi,
>
> For a query against the LocalRelation, is there anybody know what does the
> corresponding SQL looks like?  Thanks.
>
> Best regards,
>
> Ey-Chih Chow
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/corresponding-sql-for-query-against-LocalRelation-tp26093.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Jeff Zhang

It's very straightforward, please refer the document here
http://spark.apache.org/docs/latest/ml-features.html#bucketizer


On Mon, Jan 25, 2016 at 10:09 PM, Eli Super <eli.su...@gmail.com> wrote:

> Thanks Joshua ,
>
> I can't understand what algorithm behind Bucketizer  , how discretization
> done ?
>
> Best Regards
>
>
> On Mon, Jan 25, 2016 at 3:40 PM, Joshua TAYLOR <joshuaaa...@gmail.com>
> wrote:
>
>> It sounds like you may want the Bucketizer in SparkML.  The overview docs
>> [1] include, "Bucketizer transforms a column of continuous features to a
>> column of feature buckets, where the buckets are specified by users."
>>
>> [1]: http://spark.apache.org/docs/latest/ml-features.html#bucketizer
>>
>> On Mon, Jan 25, 2016 at 5:34 AM, Eli Super <eli.su...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> What is a best way to discretize Continuous Variable within  Spark
>>> DataFrames ?
>>>
>>> I want to discretize some variable 1) by equal frequency 2) by k-means
>>>
>>> I usually use R  for this porpoises
>>>
>>> _http://www.inside-r.org/packages/cran/arules/docs/discretize
>>>
>>> R code for example :
>>>
>>> ### equal frequency
>>> table(discretize(data$some_column, "frequency", categories=10))
>>>
>>>
>>> #k-means
>>> table(discretize(data$some_column, "cluster", categories=10))
>>>
>>> Thanks a lot !
>>>
>>
>>
>>
>> --
>> Joshua Taylor, http://www.cs.rpi.edu/~tayloj/
>>
>
>


-- 
Best Regards

Jeff Zhang

Re: SparkR with Hive integration

2016-01-18 Thread Jeff Zhang

Please make sure you export environment variable HADOOP_CONF_DIR which
contains the core-site.xml

On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang <zhangju...@gmail.com> wrote:

> Hi all,
>
> http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes
> From Hive tables
> <http://spark.apache.org/docs/latest/sparkr.html#from-hive-tables>
>
> You can also create SparkR DataFrames from Hive tables. To do this we will
> need to create a HiveContext which can access tables in the Hive MetaStore.
> Note that Spark should have been built with Hive support
> <http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support>
>  and
> more details on the difference between SQLContext and HiveContext can be
> found in the SQL programming guide
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
> .
>
> # sc is an existing SparkContext.
> hiveContext <- sparkRHive.init(sc)
>
> sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> sql(hiveContext, "LOAD DATA LOCAL INPATH 
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> # Queries can be expressed in HiveQL.
> results <- sql(hiveContext, "FROM src SELECT key, value")
> # results is now a DataFramehead(results)##  key   value## 1 238 val_238## 2  
> 86  val_86## 3 311 val_311
>
>
> I use RStudio to run above command, when I run "sql(hiveContext, "CREATE
> TABLE IF NOT EXISTS src (key INT, value STRING)”)”
>
> I got exception: 16/01/19 12:11:51 INFO FileUtils: Creating directory if
> it doesn't exist: file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR
> DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:file:/user/hive/warehouse/src is not a directory or
> unable to create one)
>
> How  to use HDFS instead of local file system(file)?
> Which parameter should to set?
>
> Thanks a lot.
>
>
> Peter Zhang
> --
> Google
> Sent with Airmail
>



-- 
Best Regards

Jeff Zhang

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Jeff Zhang

I also didn't find SparkSubmitDriverBootstrapper, which version of spark
are you using ?

On Wed, Jan 13, 2016 at 9:36 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you show the complete stack trace for the error ?
>
> I searched 1.6.0 code base but didn't find the
> class SparkSubmitDriverBootstrapper
>
> Thanks
>
> On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao <l...@exabeam.com> wrote:
>
>> My job runs fine in yarn cluster mode but I have reason to use client
>> mode instead. But I'm hitting this error when submitting:
>> > spark-submit --class com.exabeam.martini.scripts.SparkStreamingTest
>> --master yarn --deploy-mode client --executor-memory 90G --num-executors 3
>> --executor-cores 14 Martini-assembly-0.1.jar yarn-client
>>
>> Error: Could not find or load main class
>> org.apache.spark.deploy.SparkSubmitDriverBootstrapper
>>
>>  If I replace deploy-mode to cluster the job is submitted successfully.
>> Is there a dependency missing from my project? Right now only one I
>> included is spark-streaming 1.6.0.
>>
>
>


-- 
Best Regards

Jeff Zhang

1 2 >

1 - 100 of 194 matches

Mail list logo