Re: Choice of IDE for Spark

2021-09-30 Thread Jeff Zhang
roperty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
You can first try it via docker http://zeppelin.apache.org/download.html#using-the-official-docker-image Jeff Zhang 于2021年9月27日周一 上午6:49写道: > Hi kumar, > > You can try Zeppelin which support the udf sharing across languages > > http://zeppelin.apache.org/ > > > > >

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in (.0) >1264 >1265 args_command = "".join( > -> 1266 [get_command_part(arg, self.pool) for arg in new_args]) >1267 >1268 return args_command, temp_args > > ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py > in get_command_part(parameter, python_proxy_pool) > 296 command_part += ";" + interface > 297 else: > --> 298 command_part = REFERENCE_TYPE + parameter._get_object_id() > 299 > 300 command_part += "\n" > > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

[ANNOUNCE] Apache Zeppelin 0.10.0 is released, Spark on Zeppelin Improved

2021-08-26 Thread Jeff Zhang
/interpreter/spark.html Download it here: https://zeppelin.apache.org/download.html -- Best Regards Jeff Zhang Twitter: zjffdu

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Jeff Zhang
t created, often timing out. > > Any ideas on how to resolve this ? > Any other alternatives to databricks notebook ? > > -- Best Regards Jeff Zhang

Is the pandas version in doc of using pyarrow in spark wrong

2021-08-09 Thread Jeff Zhang
/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions -- Best Regards Jeff Zhang

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
c 2018) > > -- > *From:* Jeff Zhang > *Sent:* Thursday, December 26, 2019 5:36:50 PM > *To:* Felix Cheung > *Cc:* user.spark > *Subject:* Re: Fail to use SparkR of 3.0 preview 2 > > I use R 3.5.2 > > Felix Cheung 于2019年12月27日周五 上午4:32写道: > > I

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
I use R 3.5.2 Felix Cheung 于2019年12月27日周五 上午4:32写道: > It looks like a change in the method signature in R base packages. > > Which version of R are you running on? > > -- > *From:* Jeff Zhang > *Sent:* Thursday, December 26, 2019 12:46:12

Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
ods")): number of columns of matrices must match (see arg 2) During startup - Warning messages: 1: package ‘SparkR’ was built under R version 3.6.2 2: package ‘SparkR’ in options("defaultPackages") was not found Does anyone know what might be wrong ? Thanks -- Best Regards Jeff Zhang

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Jeff Zhang
g, but there must be > something wrong with my setup. I don't understand the code of the > ApplicationMaster, so could somebody explain me what it is trying to reach? > Where exactly does the connection timeout? So at least I can debug it > further because I don't have a clue what it is doing :-) > > Thanks for any help! > Jochen > -- Best Regards Jeff Zhang

Re: [spark on yarn] spark on yarn without DFS

2019-05-19 Thread Jeff Zhang
rn cluster mode. Could I using yarn > without start DFS, how could I use this mode? > > Yours, > Jane > -- Best Regards Jeff Zhang

Re: Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-05-01 Thread Jeff Zhang
ache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
bers for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang

Re: Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Jeff Zhang
Make sure you use the correct python which has tensorframe installed. Use PYSPARK_PYTHON to configure the python Spico Florin 于2018年8月8日周三 下午9:59写道: > Hi! > > I would like to use tensorframes in my pyspark notebook. > > I have performed the following: > > 1. In the spark intepreter adde a new

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Jeff Zhang
Check the yarn AM log for details. Aakash Basu 于2018年6月8日周五 下午4:36写道: > Hi, > > Getting this error when trying to run Spark Shell using YARN - > > Command: *spark-shell --master yarn --deploy-mode client* > > 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor >

Re: Livy Failed error on Yarn with Spark

2018-05-24 Thread Jeff Zhang
Could you check the the spark app's yarn log and livy log ? Chetan Khatri 于2018年5月10日周四 上午4:18写道: > All, > > I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I > am running same spark job using spark-submit it is getting success with all >

Re: [Spark] Supporting python 3.5?

2018-05-24 Thread Jeff Zhang
It supports python 3.5, and IIRC, spark also support python 3.6 Irving Duran 于2018年5月10日周四 下午9:08写道: > Does spark now support python 3.5 or it is just 3.4.x? > > https://spark.apache.org/docs/latest/rdd-programming-guide.html > > Thank You, > > Irving Duran >

Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due to yarn not spark. The number of AM comparing to the number of executors should be small and acceptable. If you do want to save more resources, I would suggest you to use yarn cluster mode where driver and AM run in the

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang
Hi Manuel, Looks like you are using the virtualenv of spark. Virtualenv will create python enviroment in executor. >>> --conf >>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate \ And you are not making proper configuration, spark.pyspark.virtualenv.bin.path

Re: PIG to Spark

2018-01-08 Thread Jeff Zhang
Pig support spark engine now, so you can leverage spark execution with pig script. I am afraid there's no solution to convert pig script to spark api code Pralabh Kumar 于2018年1月8日周一 下午11:25写道: > Hi > > Is there a convenient way /open source project to convert PIG

Re: pyspark configuration with Juyter

2017-11-03 Thread Jeff Zhang
You are setting PYSPARK_DRIVER to jupyter, please set it to python exec file anudeep 于2017年11月3日周五 下午7:31写道: > Hello experts, > > I install jupyter notebook thorugh anacoda, set the pyspark driver to use > jupyter notebook. > > I see the below issue when i try to open

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Jeff Zhang
Awesome ! Hyukjin Kwon 于2017年7月13日周四 上午8:48写道: > Cool! > > 2017-07-13 9:43 GMT+09:00 Denny Lee : > >> This is amazingly awesome! :) >> >> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com >> wrote: >> >>> That's great! >>> >>>

Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang
Seems it is caused by your log4j file *Caused by: java.lang.IllegalStateException: FileNamePattern [-.log] does not contain a valid date format specifier* 于2017年4月6日周四 下午4:03写道: > Hi All , > > > >I am just trying to use scala test for testing a small spark code .

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Jeff Zhang
It is fixed in https://issues.apache.org/jira/browse/SPARK-13330 Holden Karau 于2017年4月5日周三 上午12:03写道: > Which version of Spark is this (or is it a dev build)? We've recently made > some improvements with PYTHONHASHSEED propagation. > > On Tue, Apr 4, 2017 at 7:49 AM Eike

Re: 答复: submit spark task on yarn asynchronously via java?

2016-12-25 Thread Jeff Zhang
Or you can use livy for submit spark jobs http://livy.io/ Linyuxin 于2016年12月26日周一 上午10:32写道: > Thanks. > > > > *发件人:* Naveen [mailto:hadoopst...@gmail.com] > *发送时间:* 2016年12月25日 0:33 > *收件人:* Linyuxin > *抄送:* user > *主题:* Re:

Re: HiveContext is Serialized?

2016-10-25 Thread Jeff Zhang
In your sample code, you can use hiveContext in the foreach as it is scala List foreach operation which runs in driver side. But you cannot use hiveContext in RDD.foreach Ajay Chander 于2016年10月26日周三 上午11:28写道: > Hi Everyone, > > I was thinking if I can use hiveContext

Re: Using Zeppelin with Spark FP

2016-09-11 Thread Jeff Zhang
arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-05 Thread Jeff Zhang
ngConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > > at org.apache.hive.service.cli.HiveSQLException.newInstance( > HiveSQLException.java:244) > > at org.apache.hive.service.cli.HiveSQLException.toStackTrace( > HiveSQLException.java:210) > > ... 15 more > > Error: Error retrieving next row (state=,code=0) > > > > The same command works when using Spark 1.6, is it a possible issue? > > > > Thanks! > -- Best Regards Jeff Zhang

Re: spark run shell On yarn

2016-07-28 Thread Jeff Zhang
-bin-hadoop2.6/bin/spark-submit > export YARN_CONF_DIR=/etc/hadoop/conf > export HADOOP_CONF_DIR=/etc/hadoop/conf > export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6 > > > how I to update? > > > > > > === > Name: cen sujun > Mobile: 13067874572 > Mail: ce...@lotuseed.com > > -- Best Regards Jeff Zhang

Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang
dir-to-HDFS-tp27291.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
the feeling, > that I'm trying some very rare case? > > 2016-07-01 10:54 GMT-07:00 Jeff Zhang <zjf...@gmail.com>: > >> This is not a bug, because these 2 processes use the same SPARK_PID_DIR >> which is /tmp by default. Although you can resolve this by using >&

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
v.e...@gmail.com> wrote: > I get > > "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as > process 28989. Stop it first." > > Is it a bug? > > 2016-07-01 10:10 GMT-07:00 Jeff Zhang <zjf...@gmail.com>: > >> I don't think the one i

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
same process as a > server, so it makes some sense, but it's really inconvenient - I need a lot > of memory on my driver machine. Reasons for one instance per machine I do > not understand. > > -- > > > *Sincerely yoursEgor Pakhomov* > -- Best Regards Jeff Zhang

Re: Remote RPC client disassociated

2016-06-30 Thread Jeff Zhang
gt; at > scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:968) > > at > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at > scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) > > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > >at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) > BR > > > > Joaquin > This email is confidential and may be subject to privilege. If you are not > the intended recipient, please do not copy or disclose its content but > contact the sender immediately upon receipt. > -- Best Regards Jeff Zhang

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang
'writeUTF'] > > The next thing I would run into is converting the JVM RDD[String] back to > a Python RDD, what is the easiest way to do this? > > Overall, is this a good approach to calling the same API in Scala and > Python? > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Best Regards Jeff Zhang

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang
JettyServer(JettyUtils.scala:262) >> at org.apache.spark.ui.WebUI.bind(WebUI.scala:137) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at scala.Option.foreach(Option.scala:236) >> at org.apache.spark.SparkContext.(SparkContext.scala:481) >> at >> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) >> >> >> > -- Best Regards Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Jeff Zhang
ntBatch > def isErrorBatch(batch: EventBatch): Boolean = { > ^ > > /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala > Error:(86, 51) not found: type SparkFlumeProtocol > val responder = new

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
us library references that - although included in the > pom.xml build - are for some reason not found when processed within > Intellij. > -- Best Regards Jeff Zhang

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Jeff Zhang
examples.SparkPi >> --master yarn-client --driver-memory 512m --num-executors 2 >> --executor-memory 512m --executor-cores 210: >> >> >> >>- Error: Could not find or load main class >>org.apache.spark.deploy.yarn.ExecutorLauncher >> >> but i don't config that para ,there no error why???that para is only >> avoid Uploading resource file(jar package)?? >> > > -- Best Regards Jeff Zhang

Re: Does saveAsHadoopFile depend on master?

2016-06-21 Thread Jeff Zhang
thoughts on how to track down > what is happening here? > > Thanks! > > Pierre. > -- Best Regards Jeff Zhang

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Jeff Zhang
Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: How to deal with tasks running too long?

2016-06-16 Thread Jeff Zhang
e last 50 > tasks in my example can be killed (timed out) and the stage completes > successfully. > > -- > Thanks, > -Utkarsh > -- Best Regards Jeff Zhang

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
. Won't I > be able to do this processing in local mode then? > > Regards, > Tejaswini > > On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> You are using local mode, --executor-memory won't take effect for local >> mode, please use other

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
You are using local mode, --executor-memory won't take effect for local mode, please use other cluster mode. On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang <zjf...@gmail.com> wrote: > Specify --executor-memory in your spark-submit command. > > > > On Thu, Jun 16, 2016 at 9:

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Specify --executor-memory in your spark-submit command. On Thu, Jun 16, 2016 at 9:01 AM, spR <data.smar...@gmail.com> wrote: > Thank you. Can you pls tell How to increase the executor memory? > > > > On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang <zjf...@gmail.com

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
k.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.ap

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
eption. But I have 16 gb of ram > > Then my notebook dies and I get below error > > Py4JNetworkError: An error occurred while trying to connect to the Java server > > > Thank You > -- Best Regards Jeff Zhang

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
gt;>>>>> Though I didn't find answer for your first question, I think the >>>>>> following pertains to your second question: >>>>>> >>>>>> >>>>>> spark.python.worker.memory >>>>>> 512m >>>>>> >>>>>> Amount of memory to use per python worker process during >>>>>> aggregation, in the same >>>>>> format as JVM memory strings (e.g. 512m, >>>>>> 2g). If the memory >>>>>> used during aggregation goes above this amount, it will spill the >>>>>> data into disks. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken < >>>>>> carli...@janelia.hhmi.org> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We have an HPC cluster that we run Spark jobs on using standalone >>>>>>> mode and a number of scripts I’ve built up to dynamically schedule and >>>>>>> start spark clusters within the Grid Engine framework. Nodes in the >>>>>>> cluster >>>>>>> have 16 cores and 128GB of RAM. >>>>>>> >>>>>>> My users use pyspark heavily. We’ve been having a number of problems >>>>>>> with nodes going offline with extraordinarily high load. I was able to >>>>>>> look >>>>>>> at one of those nodes today before it went truly sideways, and I >>>>>>> discovered >>>>>>> that the user was running 50 pyspark.daemon threads (remember, this is >>>>>>> a 16 >>>>>>> core box), and the load was somewhere around 25 or so, with all CPUs >>>>>>> maxed >>>>>>> out at 100%. >>>>>>> >>>>>>> So while the spark worker is aware it’s only got 16 cores and >>>>>>> behaves accordingly, pyspark seems to be happy to overrun everything >>>>>>> like >>>>>>> crazy. Is there a global parameter I can use to limit pyspark threads >>>>>>> to a >>>>>>> sane number, say 15 or 16? It would also be interesting to set a memory >>>>>>> limit, which leads to another question. >>>>>>> >>>>>>> How is memory managed when pyspark is used? I have the spark worker >>>>>>> memory set to 90GB, and there is 8GB of system overhead (GPFS caching), >>>>>>> so >>>>>>> if pyspark operates outside of the JVM memory pool, that leaves it at >>>>>>> most >>>>>>> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB >>>>>>> heap (ha ha.) >>>>>>> >>>>>>> Thanks, >>>>>>> Ken Carlile >>>>>>> Sr. Unix Engineer >>>>>>> HHMI/Janelia Research Campus >>>>>>> 571-209-4363 >>>>>>> >>>>>>> >>>>>> >>>>>> Т�ХF� >>>>>> V�7V'67&�� R�� �â W6W"�V�7V'67&� 7 &�� 6�R��Фf�" FF�F��� � 6��� >>>>>> �G2� >>>>>> R�� �â W6W"ֆV� 7 &�� 6�R��Р >>>>>> >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>>>> additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>> >>>> >>>> >>> >>> >>> -- >>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>> >> >> > -- Best Regards Jeff Zhang

Re: sqlcontext - not able to connect to database

2016-06-14 Thread Jeff Zhang
gt; at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > > > -- Best Regards Jeff Zhang

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang
bscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: java.io.FileNotFoundException

2016-06-03 Thread Jeff Zhang
/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index >> (No such file or directory) >> >> any idea about this error ? >> -- >> Thanks, >> Kishore. >> > > > > -- > Thanks, > Kishore. > -- Best Regards Jeff Zhang

Bug of PolynomialExpansion ?

2016-05-29 Thread Jeff Zhang
, x2*x3, x3*x1, x3*x2,x3*x3) (3,[0,2],[1.0,1.0]) --> (9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])| -- Best Regards Jeff Zhang

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang
p/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a > INFO apache.spark.util.ShutdownHookManager - Deleting directory > /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061 > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > -- Best Regards Jeff Zhang

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-19 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both driver and executor side can get this conf. And I also want this custom hadoop conf only detected by this user's job who set this conf. Is it possible for spark thrift server now ? Thanks -- Best Regards Jeff Zhang

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang
e findings for the benefit of the community. >> >> >> Regards. >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> > > -- Best Regards Jeff Zhang

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang
> > and then spark application main class fails with class not found exception. > Is there any workaround? > -- Best Regards Jeff Zhang

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang
t > py4j.protocol.Py4JNetworkError: An error occurred while trying to connect > to the Java server > > Even though I start pyspark with these options: > ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g > --packages com.databricks:spark-csv_2.11:1.4.0 > --spark.deploy.recoveryMode=FILESYSTEM > > and this in my /conf/spark-env.sh file: > - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM > -Dspark.deploy.recoveryDirectory=/user/recovery" > > How can I get HA to work in Spark? > > thanks, > imran > > -- Best Regards Jeff Zhang

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang
- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-27 Thread Jeff Zhang
ot;="268435456", > "orc.row.index.stride"="1" ) > """ > HiveContext.sql(sqltext) > // > sqltext = """ > INSERT INTO TABLE test.dummy2 > SELECT > * > FROM tmp > """ > HiveContext.sql(sqltext) > > In Spark 1.6.1, it is throwing error as below > > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find > registered driver with class oracle.jdbc.OracleDriver > > Is this a new bug introduced in Spark 1.6.1? > > > Thanks > -- Best Regards Jeff Zhang

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
> >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
d, Apr 20, 2016 at 3:55 PM, 李明伟 <kramer2...@126.com> wrote: > Hi Jeff > > The total size of my data is less than 10M. I already set the driver > memory to 4GB. > > > > > > > > 在 2016-04-20 13:42:25,"Jeff Zhang" <zjf...@gmail.com> 写道: > &

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang
;http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >----- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > >commands, e-mail: user-h...@spark.apache.org > > > >Information transmitted by this e-mail is proprietary to Mphasis, its > >associated companies and/ or its customers and is intended > >for use only by the individual or entity to which it is addressed, and may > >contain information that is privileged, confidential or > >exempt from disclosure under applicable law. If you are not the intended > >recipient or it appears that this mail has been forwarded > >to you without proper authority, you are notified that any use or > >dissemination of this information in any manner is strictly > >prohibited. In such cases, please notify us immediately at > >mailmas...@mphasis.com and delete this mail from your records. > > > > > > > > > > > > -- Best Regards Jeff Zhang

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang
> > hadoop.proxyuser.spark.groups > * > > > > hadoop.proxyuser.spark.hosts > * > > > ... > > > hadoop.security.auth_to_local > > RULE:[1:$1@$0](spark-pantagr...@contactlab.lan)s/.*/spark/ > DEFAULT > > > > > "spark" is present as local user in all servers. > > > What does is missing here ? > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Jeff Zhang
ing an OOM > issue on the local Spark driver for some SQL code and was wondering if the > local cache load could be the culprit. > > Appreciate any thoughts. BTW, we're running Spark 1.6.0 on this particular > cluster. > > Regards, > > Soam > -- Best Regards Jeff Zhang

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-29 Thread Jeff Zhang
tance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > > > -- Best Regards Jeff Zhang

Re: run spark job

2016-03-29 Thread Jeff Zhang
e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Zhang <zjf...@gmail.com> wrote: > I think I got the root cause, you can use Text.toString() to solve this > issue. Because the Text is shared so the last record display multiple > times. > > On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote: > &g

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
I think I got the root cause, you can use Text.toString() to solve this issue. Because the Text is shared so the last record display multiple times. On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote: > Looks like a spark bug. I can reproduce it for sequ

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
rm that it is a bug. > > Thanks in advance for your help! > > -- > Thamme > -- Best Regards Jeff Zhang

Re: DataFrame vs RDD

2016-03-22 Thread Jeff Zhang
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang
. 36 more > > }}} > > python2.7 couldn't found. But i m using vertual env python 2.7 > {{{ > [ram@test-work workspace]$ python > Python 2.7.8 (default, Mar 15 2016, 04:37:00) > [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> > }}} > > Can anyone help me with this? > Thanks > -- Best Regards Jeff Zhang

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang
88 Ext: 734-6325 > > --- > TSMC PROPERTY > This email communication (and any attachments) is proprietary information > for the sole use of its > intended recipient. Any unauthorized review, use or distribution by anyone > other than the intended > recipient is strictly prohibited. If you are not the intended recipient, > please notify the sender by > replying to this email, and then delete this email and any copies of it > immediately. Thank you. > > --- > > > -- Best Regards Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported. On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote: > so, how about implementing security? Any pointer will be helpful > > On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang <zjf..

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
I view spark thriftserver as a better version > of hive one (with Spark as execution engine instead of MR/Tez) OR should I > see it as a JDBC server? > > On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> spark thrift server is very similar with hive

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
DBC/Thrift supports security? Can we restrict certain > users to access certain dataframes and not the others? > > -- > Best Regards, > Ayan Guha > -- Best Regards Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
wrote: > Thanks Jeff > > I was looking for something like ‘unregister’ > > > In SQL you use drop to delete a table. I always though register was a > strange function name. > > Register **-1 = unregister > createTable **-1 == dropTable > > Andy > > From: Jeff Zh

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
>>> sqlContext.registerDataFrameAsTable(df, "table1") >>> sqlContext.dropTempTable("table1") On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks > > Andy > -- Best Regards Jeff Zhang

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang
, which is to > > have some control over which saves go into which jobs, and then execute > the > > jobs directly. I can envision a new version of the various save functions > > which take an extra job argument, or something, or some way to defer and > > unblock job creation in the spark context. > > > > Ideas? > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang
eResourceAllocation is set to true (executor.memory = 48G > according to spark ui environment). We're also using kryo serialization and > Yarn is the resource manager. > > Any ideas as what might be going wrong and how to debug this? > > Thanks, > Arash > > -- Best Regards Jeff Zhang

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang
ize(0 to 100).map(p => p + temp). > > I am not sure if this is a known issue, or we should file a JIRA for it. > We originally came across this bug in the SciSpark project. > > Best, > > Rahul P > -- Best Regards Jeff Zhang

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang
marked as failed: > container_1456905762620_0002_01_02 on host: bold-x.rice.edu. Exit status: > 1. Diagnostics: Exception from container-launch. > > > Is there anybody know what is the problem here? > Best, > Xiaoye > -- Best Regards Jeff Zhang

Re: Spark executor killed without apparent reason

2016-03-01 Thread Jeff Zhang
D PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 / >>> ExecutorLostFailure >>> (executor 15 lost) >>> >>> here we can see executor id is 5 but executor logs itself doesn't use >>> this id as reference in log stream so it's hard to cross check logs. >>> >>> >>> Anyhow my main issue is to determine cause of executor killing. >>> >>> >>> Thanks >>> >>> Nirav >>> >>> >>> >>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > -- Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
the driver side? > > > On Sunday, February 28, 2016, Jeff Zhang <zjf...@gmail.com> wrote: > >> data skew might be possible, but not the common case. I think we should >> design for the common case, for the skew case, we may can set some >> parameter of fraction to allow

Re: Converting array to DF

2016-03-01 Thread Jeff Zhang
toDF is not a member of Array[(String, Int)] > weights.toDF("weights","value") > > I want to label columns and print out the contents in value order please I > don't know why I am getting this error > > Thanks > > -- Best Regards Jeff Zhang

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
you can already do what you proposed by creating > identical virtualenvs on all nodes on the same path and change the spark > python path to point to the virtualenv. > > Best Regards, > Mohannad > On Mar 1, 2016 06:07, "Jeff Zhang" <zjf...@gmail.com> wrote: >

Re: Save DataFrame to Hive Table

2016-02-29 Thread Jeff Zhang
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
have skew and almost all the result data are in > one or a few tasks though. > > > On Friday, February 26, 2016, Jeff Zhang <zjf...@gmail.com> wrote: > >> >> My job get this exception very easily even when I set large value of >> spark.driver.maxRe

Re: PySpark : couldn't pickle object of type class T

2016-02-28 Thread Jeff Zhang
the > union of multiple data-types. > > Thanks, > AnoopShiralige > > > On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> Avro Record is not supported by pickler, you need to create a custom >> pickler for it. But I don't think it

Re: Dynamic allocation Spark

2016-02-26 Thread Jeff Zhang
ssage in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dynamic-allocation-Spark-tp26344.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: No event log in /tmp/spark-events

2016-02-26 Thread Jeff Zhang
s, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
(1085.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) -- Best Regards Jeff Zhang

Re: When I merge some datas,can't go on...

2016-02-26 Thread Jeff Zhang
ext: > http://apache-spark-user-list.1001560.n3.nabble.com/When-I-merge-some-datas-can-t-go-on-tp26341.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: us

Re: PySpark : couldn't pickle object of type class T

2016-02-24 Thread Jeff Zhang
-- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: which master option to view current running job in Spark UI

2016-02-23 Thread Jeff Zhang
th this option I cant see the currently running jobs in Spark WEB UI >> though it later appear in spark history server. >> >> My question with which --master option should I run my spark jobs so that >> I can view the currently running jobs in spark web UI . >> >> Thanks, >> Divya >> > -- Best Regards Jeff Zhang

Re: Joining three tables with data frames

2016-02-13 Thread Jeff Zhang
be > defined in data frame for each table rather than importing the whole > columns. > > > > Thanks, > > > > > > Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > -- Best Regards Jeff Zhang

Re: corresponding sql for query against LocalRelation

2016-01-27 Thread Jeff Zhang
gainst-LocalRelation-tp26093.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Jeff Zhang
ises >>> >>> _http://www.inside-r.org/packages/cran/arules/docs/discretize >>> >>> R code for example : >>> >>> ### equal frequency >>> table(discretize(data$some_column, "frequency", categories=10)) >>> >>> >>> #k-means >>> table(discretize(data$some_column, "cluster", categories=10)) >>> >>> Thanks a lot ! >>> >> >> >> >> -- >> Joshua Taylor, http://www.cs.rpi.edu/~tayloj/ >> > > -- Best Regards Jeff Zhang

Re: SparkR with Hive integration

2016-01-18 Thread Jeff Zhang
LTask: org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:file:/user/hive/warehouse/src is not a directory or > unable to create one) > > How to use HDFS instead of local file system(file)? > Which parameter should to set? > > Thanks a lot. > > > Peter Zhang > -- > Google > Sent with Airmail > -- Best Regards Jeff Zhang

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Jeff Zhang
rk.deploy.SparkSubmitDriverBootstrapper >> >> If I replace deploy-mode to cluster the job is submitted successfully. >> Is there a dependency missing from my project? Right now only one I >> included is spark-streaming 1.6.0. >> > > -- Best Regards Jeff Zhang

  1   2   >