Re: Choice of IDE for Spark

2021-09-30 Thread Jeff Zhang
roperty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
You can first try it via docker http://zeppelin.apache.org/download.html#using-the-official-docker-image Jeff Zhang 于2021年9月27日周一 上午6:49写道: > Hi kumar, > > You can try Zeppelin which support the udf sharing across languages > > http://zeppelin.apache.org/ > > > > >

Re: Unable to use scala function in pyspark

2021-09-26 Thread Jeff Zhang
1268 return args_command, temp_args > > ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in (.0) >1264 > 1265 args_command = "".join( > -> 1266 [get_command_part(arg, self.pool) for arg in new_args]) >1267 >1268 return args_command, temp_args > > ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py > in get_command_part(parameter, python_proxy_pool) > 296 command_part += ";" + interface > 297 else: > --> 298 command_part = REFERENCE_TYPE + parameter._get_object_id() > 299 > 300 command_part += "\n" > > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

[ANNOUNCE] Apache Zeppelin 0.10.0 is released, Spark on Zeppelin Improved

2021-08-26 Thread Jeff Zhang
/interpreter/spark.html Download it here: https://zeppelin.apache.org/download.html -- Best Regards Jeff Zhang Twitter: zjffdu

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-16 Thread Jeff Zhang
to get created, often timing out. > > Any ideas on how to resolve this ? > Any other alternatives to databricks notebook ? > > -- Best Regards Jeff Zhang

Is the pandas version in doc of using pyarrow in spark wrong

2021-08-09 Thread Jeff Zhang
/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions -- Best Regards Jeff Zhang

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
c 2018) > > -- > *From:* Jeff Zhang > *Sent:* Thursday, December 26, 2019 5:36:50 PM > *To:* Felix Cheung > *Cc:* user.spark > *Subject:* Re: Fail to use SparkR of 3.0 preview 2 > > I use R 3.5.2 > > Felix Cheung 于2019年12月27日周五 上午4:32写道: > > It l

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
I use R 3.5.2 Felix Cheung 于2019年12月27日周五 上午4:32写道: > It looks like a change in the method signature in R base packages. > > Which version of R are you running on? > > -- > *From:* Jeff Zhang > *Sent:* Thursday, December 26, 2019 12:46:12

Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Jeff Zhang
")): number of columns of matrices must match (see arg 2) During startup - Warning messages: 1: package ‘SparkR’ was built under R version 3.6.2 2: package ‘SparkR’ in options("defaultPackages") was not found Does anyone know what might be wrong ? Thanks -- Best Regards Jeff Zhang

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Jeff Zhang
d there's no bug, but there must be > something wrong with my setup. I don't understand the code of the > ApplicationMaster, so could somebody explain me what it is trying to reach? > Where exactly does the connection timeout? So at least I can debug it > further because I don't have a clue what it is doing :-) > > Thanks for any help! > Jochen > -- Best Regards Jeff Zhang

Re: [spark on yarn] spark on yarn without DFS

2019-05-19 Thread Jeff Zhang
rn cluster mode. Could I using yarn > without start DFS, how could I use this mode? > > Yours, > Jane > -- Best Regards Jeff Zhang

Re: Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-05-01 Thread Jeff Zhang
://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
bers for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang

Re: Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Jeff Zhang
Make sure you use the correct python which has tensorframe installed. Use PYSPARK_PYTHON to configure the python Spico Florin 于2018年8月8日周三 下午9:59写道: > Hi! > > I would like to use tensorframes in my pyspark notebook. > > I have performed the following: > > 1. In the spark intepreter adde a new r

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Jeff Zhang
Check the yarn AM log for details. Aakash Basu 于2018年6月8日周五 下午4:36写道: > Hi, > > Getting this error when trying to run Spark Shell using YARN - > > Command: *spark-shell --master yarn --deploy-mode client* > > 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor > spark.yarn.archiv

Re: Livy Failed error on Yarn with Spark

2018-05-24 Thread Jeff Zhang
Could you check the the spark app's yarn log and livy log ? Chetan Khatri 于2018年5月10日周四 上午4:18写道: > All, > > I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I > am running same spark job using spark-submit it is getting success with all > transformations are done. > > When

Re: [Spark] Supporting python 3.5?

2018-05-24 Thread Jeff Zhang
It supports python 3.5, and IIRC, spark also support python 3.6 Irving Duran 于2018年5月10日周四 下午9:08写道: > Does spark now support python 3.5 or it is just 3.4.x? > > https://spark.apache.org/docs/latest/rdd-programming-guide.html > > Thank You, > > Irving Duran >

Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due to yarn not spark. The number of AM comparing to the number of executors should be small and acceptable. If you do want to save more resources, I would suggest you to use yarn cluster mode where driver and AM run in the same

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang
Hi Manuel, Looks like you are using the virtualenv of spark. Virtualenv will create python enviroment in executor. >>> --conf >>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate \ And you are not making proper configuration, spark.pyspark.virtualenv.bin.path sh

Re: PIG to Spark

2018-01-08 Thread Jeff Zhang
Pig support spark engine now, so you can leverage spark execution with pig script. I am afraid there's no solution to convert pig script to spark api code Pralabh Kumar 于2018年1月8日周一 下午11:25写道: > Hi > > Is there a convenient way /open source project to convert PIG scripts to > Spark. > > > Re

Re: pyspark configuration with Juyter

2017-11-03 Thread Jeff Zhang
You are setting PYSPARK_DRIVER to jupyter, please set it to python exec file anudeep 于2017年11月3日周五 下午7:31写道: > Hello experts, > > I install jupyter notebook thorugh anacoda, set the pyspark driver to use > jupyter notebook. > > I see the below issue when i try to open pyspark. > > anudeepg@datan

Re: for loops in pyspark

2017-09-21 Thread Jeff Zhang
I suspect OOO happens in executor side, you have to check the stacktrace by yourself if you can not attach more info. Most likely it is due to your user code. Alexander Czech 于2017年9月21日周四 下午5:54写道: > That is not really possible the whole project is rather large and I would > not like to release

Re: Is there a SparkILoop for Java?

2017-09-21 Thread Jeff Zhang
You may try zeppelin which provide rest api to execute code https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/rest_api/notebook.html#create-a-new-note kant kodali 于2017年9月21日周四 上午4:09写道: > Is there an API like SparkILoop >

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Jeff Zhang
Awesome ! Hyukjin Kwon 于2017年7月13日周四 上午8:48写道: > Cool! > > 2017-07-13 9:43 GMT+09:00 Denny Lee : > >> This is amazingly awesome! :) >> >> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com >> wrote: >> >>> That's great! >>> >>> >>> >>> On 12 July 2017 at 12:41, Felix Cheung >>> wrote: >>>

Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang
Seems it is caused by your log4j file *Caused by: java.lang.IllegalStateException: FileNamePattern [-.log] does not contain a valid date format specifier* 于2017年4月6日周四 下午4:03写道: > Hi All , > > > >I am just trying to use scala test for testing a small spark code . But > spark context is no

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Jeff Zhang
It is fixed in https://issues.apache.org/jira/browse/SPARK-13330 Holden Karau 于2017年4月5日周三 上午12:03写道: > Which version of Spark is this (or is it a dev build)? We've recently made > some improvements with PYTHONHASHSEED propagation. > > On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern cal.com> w

Re: 答复: submit spark task on yarn asynchronously via java?

2016-12-25 Thread Jeff Zhang
Or you can use livy for submit spark jobs http://livy.io/ Linyuxin 于2016年12月26日周一 上午10:32写道: > Thanks. > > > > *发件人:* Naveen [mailto:hadoopst...@gmail.com] > *发送时间:* 2016年12月25日 0:33 > *收件人:* Linyuxin > *抄送:* user > *主题:* Re: 答复: submit spark task on yarn asynchronously via java? > > > > Hi,

Re: HiveContext is Serialized?

2016-10-25 Thread Jeff Zhang
In your sample code, you can use hiveContext in the foreach as it is scala List foreach operation which runs in driver side. But you cannot use hiveContext in RDD.foreach Ajay Chander 于2016年10月26日周三 上午11:28写道: > Hi Everyone, > > I was thinking if I can use hiveContext inside foreach like below,

Re: Using Zeppelin with Spark FP

2016-09-11 Thread Jeff Zhang
n this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-05 Thread Jeff Zhang
nstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > > at org.apache.hive.service.cli.HiveSQLException.newInstance( > HiveSQLException.java:244) > > at org.apache.hive.service.cli.HiveSQLException.toStackTrace( > HiveSQLException.java:210) > > ... 15 more > > Error: Error retrieving next row (state=,code=0) > > > > The same command works when using Spark 1.6, is it a possible issue? > > > > Thanks! > -- Best Regards Jeff Zhang

Re: spark run shell On yarn

2016-07-28 Thread Jeff Zhang
mit > export YARN_CONF_DIR=/etc/hadoop/conf > export HADOOP_CONF_DIR=/etc/hadoop/conf > export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6 > > > how I to update? > > > > > > === > Name: cen sujun > Mobile: 13067874572 > Mail: ce...@lotuseed.com > > -- Best Regards Jeff Zhang

Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang
ocal-dir-to-HDFS-tp27291.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
'm trying some very rare case? > > 2016-07-01 10:54 GMT-07:00 Jeff Zhang : > >> This is not a bug, because these 2 processes use the same SPARK_PID_DIR >> which is /tmp by default. Although you can resolve this by using >> different SPARK_PID_DIR, I suspect you would

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
wrote: > I get > > "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as > process 28989. Stop it first." > > Is it a bug? > > 2016-07-01 10:10 GMT-07:00 Jeff Zhang : > >> I don't think the one instance per machine is true. As long as yo

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
; server, so it makes some sense, but it's really inconvenient - I need a lot > of memory on my driver machine. Reasons for one instance per machine I do > not understand. > > -- > > > *Sincerely yoursEgor Pakhomov* > -- Best Regards Jeff Zhang

Re: Remote RPC client disassociated

2016-06-30 Thread Jeff Zhang
ction.Iterator$$anon$11.next(Iterator.scala:328) > > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914) > > at > scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) > > at > scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:968) > > at > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at > scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) > > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > >at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) > BR > > > > Joaquin > This email is confidential and may be subject to privilege. If you are not > the intended recipient, please do not copy or disclose its content but > contact the sender immediately upon receipt. > -- Best Regards Jeff Zhang

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang
Dataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', > 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', > 'writeIteratorToStream', 'writeUTF'] > > The next thing I would run into is converting the JVM RDD[String] back to > a Python RDD, what is the easiest way to do this? > > Overall, is this a good approach to calling the same API in Scala and > Python? > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Best Regards Jeff Zhang

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang
:137) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at >> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) >> at scala.Option.foreach(Option.scala:236) >> at org.apache.spark.SparkContext.(SparkContext.scala:481) >> at >> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) >> >> >> > -- Best Regards Jeff Zhang

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
sErrorBatch(batch: EventBatch): Boolean = { > ^ > > /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala > Error:(86, 51) not found: type SparkFlumeProtocol > val responder = new SpecificResponder(classOf[Spar

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
that - although included in the > pom.xml build - are for some reason not found when processed within > Intellij. > -- Best Regards Jeff Zhang

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-21 Thread Jeff Zhang
ter yarn-client --driver-memory 512m --num-executors 2 >> --executor-memory 512m --executor-cores 210: >> >> >> >>- Error: Could not find or load main class >>org.apache.spark.deploy.yarn.ExecutorLauncher >> >> but i don't config that para ,there no error why???that para is only >> avoid Uploading resource file(jar package)?? >> > > -- Best Regards Jeff Zhang

Re: Does saveAsHadoopFile depend on master?

2016-06-21 Thread Jeff Zhang
gt; what is happening here? > > Thanks! > > Pierre. > -- Best Regards Jeff Zhang

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Jeff Zhang
List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: How to deal with tasks running too long?

2016-06-16 Thread Jeff Zhang
asks in my example can be killed (timed out) and the stage completes > successfully. > > -- > Thanks, > -Utkarsh > -- Best Regards Jeff Zhang

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
e to do this processing in local mode then? > > Regards, > Tejaswini > > On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang wrote: > >> You are using local mode, --executor-memory won't take effect for local >> mode, please use other cluster mode. >> >> On Th

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
You are using local mode, --executor-memory won't take effect for local mode, please use other cluster mode. On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang wrote: > Specify --executor-memory in your spark-submit command. > > > > On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: >

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Specify --executor-memory in your spark-submit command. On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: > Thank you. Can you pls tell How to increase the executor memory? > > > > On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote: > >> >>> Caused by: java.lang.Ou

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
nsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
of ram > > Then my notebook dies and I get below error > > Py4JNetworkError: An error occurred while trying to connect to the Java server > > > Thank You > -- Best Regards Jeff Zhang

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
;>>>> >>>>>> >>>>>> spark.python.worker.memory >>>>>> 512m >>>>>> >>>>>> Amount of memory to use per python worker process during >>>>>> aggregation, in the same >>>>>> format as JVM memory strings (e.g. 512m, >>>>>> 2g). If the memory >>>>>> used during aggregation goes above this amount, it will spill the >>>>>> data into disks. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken < >>>>>> carli...@janelia.hhmi.org> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We have an HPC cluster that we run Spark jobs on using standalone >>>>>>> mode and a number of scripts I’ve built up to dynamically schedule and >>>>>>> start spark clusters within the Grid Engine framework. Nodes in the >>>>>>> cluster >>>>>>> have 16 cores and 128GB of RAM. >>>>>>> >>>>>>> My users use pyspark heavily. We’ve been having a number of problems >>>>>>> with nodes going offline with extraordinarily high load. I was able to >>>>>>> look >>>>>>> at one of those nodes today before it went truly sideways, and I >>>>>>> discovered >>>>>>> that the user was running 50 pyspark.daemon threads (remember, this is >>>>>>> a 16 >>>>>>> core box), and the load was somewhere around 25 or so, with all CPUs >>>>>>> maxed >>>>>>> out at 100%. >>>>>>> >>>>>>> So while the spark worker is aware it’s only got 16 cores and >>>>>>> behaves accordingly, pyspark seems to be happy to overrun everything >>>>>>> like >>>>>>> crazy. Is there a global parameter I can use to limit pyspark threads >>>>>>> to a >>>>>>> sane number, say 15 or 16? It would also be interesting to set a memory >>>>>>> limit, which leads to another question. >>>>>>> >>>>>>> How is memory managed when pyspark is used? I have the spark worker >>>>>>> memory set to 90GB, and there is 8GB of system overhead (GPFS caching), >>>>>>> so >>>>>>> if pyspark operates outside of the JVM memory pool, that leaves it at >>>>>>> most >>>>>>> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB >>>>>>> heap (ha ha.) >>>>>>> >>>>>>> Thanks, >>>>>>> Ken Carlile >>>>>>> Sr. Unix Engineer >>>>>>> HHMI/Janelia Research Campus >>>>>>> 571-209-4363 >>>>>>> >>>>>>> >>>>>> >>>>>> Т�ХF� >>>>>> V�7V'67&�&R� R�� �â W6W"�V�7V'67&�&T 7 &�� 6�R��&pФf�" FF�F��� � 6��� >>>>>> �G2� >>>>>> R�� �â W6W"ֆV� 7 &�� 6�R��&pР >>>>>> >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>>>> additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>>> >>>> >>>> >>> >>> >>> -- >>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >>> >> >> > -- Best Regards Jeff Zhang

Re: sqlcontext - not able to connect to database

2016-06-14 Thread Jeff Zhang
gt; at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > > > -- Best Regards Jeff Zhang

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang
spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: java.io.FileNotFoundException

2016-06-03 Thread Jeff Zhang
c79c725/28/shuffle_8553_38_0.index >> (No such file or directory) >> >> any idea about this error ? >> -- >> Thanks, >> Kishore. >> > > > > -- > Thanks, > Kishore. > -- Best Regards Jeff Zhang

Bug of PolynomialExpansion ?

2016-05-29 Thread Jeff Zhang
2*x1,x2*x2, x2*x3, x3*x1, x3*x2,x3*x3) (3,[0,2],[1.0,1.0]) --> (9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])| -- Best Regards Jeff Zhang

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang
2-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a > INFO apache.spark.util.ShutdownHookManager - Deleting directory > /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061 > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > -- Best Regards Jeff Zhang

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-18 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both driver and executor side can get this conf. And I also want this custom hadoop conf only detected by this user's job who set this conf. Is it possible for spark thrift server now ? Thanks -- Best Regards Jeff Zhang

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang
y commercial proposition or >> anything like that. As I seem to get involved with members troubleshooting >> issues and threads on this topic, I thought it is worthwhile writing a note >> about it to summarise the findings for the benefit of the community. >> >> >> Regards. >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> > > -- Best Regards Jeff Zhang

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang
plication main class fails with class not found exception. > Is there any workaround? > -- Best Regards Jeff Zhang

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang
kError: An error occurred while trying to connect > to the Java server > > Even though I start pyspark with these options: > ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g > --packages com.databricks:spark-csv_2.11:1.4.0 > --spark.deploy.recoveryMode=FILESYSTEM > > and this in my /conf/spark-env.sh file: > - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM > -Dspark.deploy.recoveryDirectory=/user/recovery" > > How can I get HA to work in Spark? > > thanks, > imran > > -- Best Regards Jeff Zhang

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang
xt: > http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-27 Thread Jeff Zhang
w.index.stride"="1" ) > """ > HiveContext.sql(sqltext) > // > sqltext = """ > INSERT INTO TABLE test.dummy2 > SELECT > * > FROM tmp > """ > HiveContext.sql(sqltext) > > In Spark 1.6.1, it is throwing error as below > > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find > registered driver with class oracle.jdbc.OracleDriver > > Is this a new bug introduced in Spark 1.6.1? > > > Thanks > -- Best Regards Jeff Zhang

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
;>> > Regards, >>>>> > Raghava. >>>>> > >>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar >>>>> wrote: >>>>> > >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
d, Apr 20, 2016 at 3:55 PM, 李明伟 wrote: > Hi Jeff > > The total size of my data is less than 10M. I already set the driver > memory to 4GB. > > > > > > > > 在 2016-04-20 13:42:25,"Jeff Zhang" 写道: > > Seems it is OOM in driver side when fetching

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang
rk User List mailing list archive at Nabble.com. > > > >----- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > >commands, e-mail: user-h...@spark.apache.org > > > >Information transmitted by this e-mail is proprietary to Mphasis, its > >associated companies and/ or its customers and is intended > >for use only by the individual or entity to which it is addressed, and may > >contain information that is privileged, confidential or > >exempt from disclosure under applicable law. If you are not the intended > >recipient or it appears that this mail has been forwarded > >to you without proper authority, you are notified that any use or > >dissemination of this information in any manner is strictly > >prohibited. In such cases, please notify us immediately at > >mailmas...@mphasis.com and delete this mail from your records. > > > > > > > > > > > > -- Best Regards Jeff Zhang

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang
op.proxyuser.spark.groups > * > > > > hadoop.proxyuser.spark.hosts > * > > > ... > > > hadoop.security.auth_to_local > > RULE:[1:$1@$0](spark-pantagr...@contactlab.lan)s/.*/spark/ > DEFAULT > > > > > "spark" is present as local user in all servers. > > > What does is missing here ? > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Jeff Zhang
he local Spark driver for some SQL code and was wondering if the > local cache load could be the culprit. > > Appreciate any thoughts. BTW, we're running Spark 1.6.0 on this particular > cluster. > > Regards, > > Soam > -- Best Regards Jeff Zhang

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-29 Thread Jeff Zhang
legatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > > > -- Best Regards Jeff Zhang

Re: run spark job

2016-03-29 Thread Jeff Zhang
scr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang wrote: > I think I got the root cause, you can use Text.toString() to solve this > issue. Because the Text is shared so the last record display multiple > times. > > On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang wrote: > >> L

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
I think I got the root cause, you can use Text.toString() to solve this issue. Because the Text is shared so the last record display multiple times. On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang wrote: > Looks like a spark bug. I can reproduce it for sequence file, but it works > for tex

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
> > Thanks in advance for your help! > > -- > Thamme > -- Best Regards Jeff Zhang

Re: DataFrame vs RDD

2016-03-22 Thread Jeff Zhang
-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang
; python2.7 couldn't found. But i m using vertual env python 2.7 > {{{ > [ram@test-work workspace]$ python > Python 2.7.8 (default, Mar 15 2016, 04:37:00) > [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> > }}} > > Can anyone help me with this? > Thanks > -- Best Regards Jeff Zhang

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang
> --- > TSMC PROPERTY > This email communication (and any attachments) is proprietary information > for the sole use of its > intended recipient. Any unauthorized review, use or distribution by anyone > other than the intended > recipient is strictly prohibited. If you are not the intended recipient, > please notify the sender by > replying to this email, and then delete this email and any copies of it > immediately. Thank you. > > --- > > > -- Best Regards Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported. On Wed, Mar 16, 2016 at 10:48 AM, ayan guha wrote: > so, how about implementing security? Any pointer will be helpful > > On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang wrote: > >> The spark thrift serve

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
r as a better version > of hive one (with Spark as execution engine instead of MR/Tez) OR should I > see it as a JDBC server? > > On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang wrote: > >> spark thrift server is very similar with hive thrift server. You can use >> hive jdbc dri

Re: Spark UI Completed Jobs

2016-03-15 Thread Jeff Zhang
19841/19788 > *(41405 skipped)* > Thanks, > Prabhu Joseph > -- Best Regards Jeff Zhang

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
ty? Can we restrict certain > users to access certain dataframes and not the others? > > -- > Best Regards, > Ayan Guha > -- Best Regards Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
wrote: > Thanks Jeff > > I was looking for something like ‘unregister’ > > > In SQL you use drop to delete a table. I always though register was a > strange function name. > > Register **-1 = unregister > createTable **-1 == dropTable > > Andy > > From: Jeff Zhang

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
>>> sqlContext.registerDataFrameAsTable(df, "table1") >>> sqlContext.dropTempTable("table1") On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks > > Andy > -- Best Regards Jeff Zhang

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang
want in Spark, which is to > > have some control over which saves go into which jobs, and then execute > the > > jobs directly. I can envision a new version of the various save functions > > which take an extra job argument, or something, or some way to defer and > > unblock job creation in the spark context. > > > > Ideas? > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

2016-03-07 Thread Jeff Zhang
as > I can't find any mention of the environment of the driver program > overriding the environment in the workers, also that environment variable > was previously completely unset in the driver program anyway. > > Is there an explanation for this to help me understand how to do things > properly? We run Spark 1.6.0 on Ubuntu 14.04. > > Thanks > > Kostas > -- Best Regards Jeff Zhang

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang
.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) > > > I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark > property maximizeResourceAllocation is set to true (executor.memory = 48G > according to spark ui environment). We're also using kryo serialization and > Yarn is the resource manager. > > Any ideas as what might be going wrong and how to debug this? > > Thanks, > Arash > > -- Best Regards Jeff Zhang

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang
+ temp). > > I am not sure if this is a known issue, or we should file a JIRA for it. > We originally came across this bug in the SciSpark project. > > Best, > > Rahul P > -- Best Regards Jeff Zhang

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang
ed: > container_1456905762620_0002_01_02 on host: bold-x.rice.edu. Exit status: > 1. Diagnostics: Exception from container-launch. > > > Is there anybody know what is the problem here? > Best, > Xiaoye > -- Best Regards Jeff Zhang

Re: Spark executor killed without apparent reason

2016-03-01 Thread Jeff Zhang
can'\t >>> seem to connect it to Above executor logs. I think executor logs should at >>> least have mentioning of executor ID/task ID (EID-TID) and not just task ID >>> (TID). >>> >>> this is snippet of driver logs from ui: >>> >>> 189 15283 0 FAILED PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 / >>> ExecutorLostFailure >>> (executor 15 lost) >>> >>> here we can see executor id is 5 but executor logs itself doesn't use >>> this id as reference in log stream so it's hard to cross check logs. >>> >>> >>> Anyhow my main issue is to determine cause of executor killing. >>> >>> >>> Thanks >>> >>> Nirav >>> >>> >>> >>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > -- Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
n Sunday, February 28, 2016, Jeff Zhang wrote: > >> data skew might be possible, but not the common case. I think we should >> design for the common case, for the skew case, we may can set some >> parameter of fraction to allow user to tune it. >> >> On Sat, Feb

Re: Converting array to DF

2016-03-01 Thread Jeff Zhang
ring, Int)] > weights.toDF("weights","value") > > I want to label columns and print out the contents in value order please I > don't know why I am getting this error > > Thanks > > -- Best Regards Jeff Zhang

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
ady do what you proposed by creating > identical virtualenvs on all nodes on the same path and change the spark > python path to point to the virtualenv. > > Best Regards, > Mohannad > On Mar 1, 2016 06:07, "Jeff Zhang" wrote: > >> I have created jira for this f

Re: Save DataFrame to Hive Table

2016-02-29 Thread Jeff Zhang
t$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSu

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
- spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
the result data are in > one or a few tasks though. > > > On Friday, February 26, 2016, Jeff Zhang wrote: > >> >> My job get this exception very easily even when I set large value of >> spark.driver.maxResultSize. After checking the spark code, I found >>

Re: PySpark : couldn't pickle object of type class T

2016-02-28 Thread Jeff Zhang
pes. > > Thanks, > AnoopShiralige > > > On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang wrote: > >> Avro Record is not supported by pickler, you need to create a custom >> pickler for it. But I don't think it worth to do that. Actually you can >> use package

Re: Dynamic allocation Spark

2016-02-26 Thread Jeff Zhang
> http://apache-spark-user-list.1001560.n3.nabble.com/Dynamic-allocation-Spark-tp26344.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: No event log in /tmp/spark-events

2016-02-26 Thread Jeff Zhang
ark-events-tp26318p26343.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
asks (1085.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) -- Best Regards Jeff Zhang

Re: When I merge some datas,can't go on...

2016-02-26 Thread Jeff Zhang
tp://apache-spark-user-list.1001560.n3.nabble.com/When-I-merge-some-datas-can-t-go-on-tp26341.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: PySpark : couldn't pickle object of type class T

2016-02-24 Thread Jeff Zhang
age in context: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: which master option to view current running job in Spark UI

2016-02-23 Thread Jeff Zhang
ee the currently running jobs in Spark WEB UI >> though it later appear in spark history server. >> >> My question with which --master option should I run my spark jobs so that >> I can view the currently running jobs in spark web UI . >> >> Thanks, >> Divya >> > -- Best Regards Jeff Zhang

Re: Joining three tables with data frames

2016-02-13 Thread Jeff Zhang
n data frame for each table rather than importing the whole > columns. > > > > Thanks, > > > > > > Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > -- Best Regards Jeff Zhang

  1   2   3   >