roperty which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
--
Best Regards
Jeff Zhang
You can first try it via docker
http://zeppelin.apache.org/download.html#using-the-official-docker-image
Jeff Zhang 于2021年9月27日周一 上午6:49写道:
> Hi kumar,
>
> You can try Zeppelin which support the udf sharing across languages
>
> http://zeppelin.apache.org/
>
>
>
>
>
2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
> in (.0)
>1264
>1265 args_command = "".join(
> -> 1266 [get_command_part(arg, self.pool) for arg in new_args])
>1267
>1268 return args_command, temp_args
>
> ~/.sdkman/candidates/spark/3.0.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py
> in get_command_part(parameter, python_proxy_pool)
> 296 command_part += ";" + interface
> 297 else:
> --> 298 command_part = REFERENCE_TYPE + parameter._get_object_id()
> 299
> 300 command_part += "\n"
>
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
/interpreter/spark.html
Download it here: https://zeppelin.apache.org/download.html
--
Best Regards
Jeff Zhang
Twitter: zjffdu
t created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>
>
--
Best Regards
Jeff Zhang
/api/python/user_guide/arrow_pandas.html#recommended-pandas-and-pyarrow-versions
--
Best Regards
Jeff Zhang
c 2018)
>
> --
> *From:* Jeff Zhang
> *Sent:* Thursday, December 26, 2019 5:36:50 PM
> *To:* Felix Cheung
> *Cc:* user.spark
> *Subject:* Re: Fail to use SparkR of 3.0 preview 2
>
> I use R 3.5.2
>
> Felix Cheung 于2019年12月27日周五 上午4:32写道:
>
> I
I use R 3.5.2
Felix Cheung 于2019年12月27日周五 上午4:32写道:
> It looks like a change in the method signature in R base packages.
>
> Which version of R are you running on?
>
> --
> *From:* Jeff Zhang
> *Sent:* Thursday, December 26, 2019 12:46:12
ods")):
number of columns of matrices must match (see arg 2)
During startup - Warning messages:
1: package ‘SparkR’ was built under R version 3.6.2
2: package ‘SparkR’ in options("defaultPackages") was not found
Does anyone know what might be wrong ? Thanks
--
Best Regards
Jeff Zhang
g, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>
--
Best Regards
Jeff Zhang
rn cluster mode. Could I using yarn
> without start DFS, how could I use this mode?
>
> Yours,
> Jane
>
--
Best Regards
Jeff Zhang
ache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
bers for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>
--
Best Regards
Jeff Zhang
Make sure you use the correct python which has tensorframe installed.
Use PYSPARK_PYTHON
to configure the python
Spico Florin 于2018年8月8日周三 下午9:59写道:
> Hi!
>
> I would like to use tensorframes in my pyspark notebook.
>
> I have performed the following:
>
> 1. In the spark intepreter adde a new
Check the yarn AM log for details.
Aakash Basu 于2018年6月8日周五 下午4:36写道:
> Hi,
>
> Getting this error when trying to run Spark Shell using YARN -
>
> Command: *spark-shell --master yarn --deploy-mode client*
>
> 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor
>
Could you check the the spark app's yarn log and livy log ?
Chetan Khatri 于2018年5月10日周四 上午4:18写道:
> All,
>
> I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I
> am running same spark job using spark-submit it is getting success with all
>
It supports python 3.5, and IIRC, spark also support python 3.6
Irving Duran 于2018年5月10日周四 下午9:08写道:
> Does spark now support python 3.5 or it is just 3.4.x?
>
> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>
> Thank You,
>
> Irving Duran
>
I don't think it is possible to have less than 1 core for AM, this is due
to yarn not spark.
The number of AM comparing to the number of executors should be small and
acceptable. If you do want to save more resources, I would suggest you to
use yarn cluster mode where driver and AM run in the
Hi Manuel,
Looks like you are using the virtualenv of spark. Virtualenv will create
python enviroment in executor.
>>> --conf
>>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
\
And you are not making proper configuration, spark.pyspark.virtualenv.bin.path
Pig support spark engine now, so you can leverage spark execution with pig
script.
I am afraid there's no solution to convert pig script to spark api code
Pralabh Kumar 于2018年1月8日周一 下午11:25写道:
> Hi
>
> Is there a convenient way /open source project to convert PIG
You are setting PYSPARK_DRIVER to jupyter, please set it to python exec file
anudeep 于2017年11月3日周五 下午7:31写道:
> Hello experts,
>
> I install jupyter notebook thorugh anacoda, set the pyspark driver to use
> jupyter notebook.
>
> I see the below issue when i try to open
Awesome !
Hyukjin Kwon 于2017年7月13日周四 上午8:48写道:
> Cool!
>
> 2017-07-13 9:43 GMT+09:00 Denny Lee :
>
>> This is amazingly awesome! :)
>>
>> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com
>> wrote:
>>
>>> That's great!
>>>
>>>
Seems it is caused by your log4j file
*Caused by: java.lang.IllegalStateException: FileNamePattern [-.log]
does not contain a valid date format specifier*
于2017年4月6日周四 下午4:03写道:
> Hi All ,
>
>
>
>I am just trying to use scala test for testing a small spark code .
It is fixed in https://issues.apache.org/jira/browse/SPARK-13330
Holden Karau 于2017年4月5日周三 上午12:03写道:
> Which version of Spark is this (or is it a dev build)? We've recently made
> some improvements with PYTHONHASHSEED propagation.
>
> On Tue, Apr 4, 2017 at 7:49 AM Eike
Or you can use livy for submit spark jobs
http://livy.io/
Linyuxin 于2016年12月26日周一 上午10:32写道:
> Thanks.
>
>
>
> *发件人:* Naveen [mailto:hadoopst...@gmail.com]
> *发送时间:* 2016年12月25日 0:33
> *收件人:* Linyuxin
> *抄送:* user
> *主题:* Re:
In your sample code, you can use hiveContext in the foreach as it is scala
List foreach operation which runs in driver side. But you cannot use
hiveContext in RDD.foreach
Ajay Chander 于2016年10月26日周三 上午11:28写道:
> Hi Everyone,
>
> I was thinking if I can use hiveContext
arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
--
Best Regards
Jeff Zhang
ngConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
> at org.apache.hive.service.cli.HiveSQLException.newInstance(
> HiveSQLException.java:244)
>
> at org.apache.hive.service.cli.HiveSQLException.toStackTrace(
> HiveSQLException.java:210)
>
> ... 15 more
>
> Error: Error retrieving next row (state=,code=0)
>
>
>
> The same command works when using Spark 1.6, is it a possible issue?
>
>
>
> Thanks!
>
--
Best Regards
Jeff Zhang
-bin-hadoop2.6/bin/spark-submit
> export YARN_CONF_DIR=/etc/hadoop/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6
>
>
> how I to update?
>
>
>
>
>
> ===
> Name: cen sujun
> Mobile: 13067874572
> Mail: ce...@lotuseed.com
>
>
--
Best Regards
Jeff Zhang
dir-to-HDFS-tp27291.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
the feeling,
> that I'm trying some very rare case?
>
> 2016-07-01 10:54 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> This is not a bug, because these 2 processes use the same SPARK_PID_DIR
>> which is /tmp by default. Although you can resolve this by using
>&
v.e...@gmail.com>
wrote:
> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989. Stop it first."
>
> Is it a bug?
>
> 2016-07-01 10:10 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> I don't think the one i
same process as a
> server, so it makes some sense, but it's really inconvenient - I need a lot
> of memory on my driver machine. Reasons for one instance per machine I do
> not understand.
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>
--
Best Regards
Jeff Zhang
gt; at
> scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:968)
>
> at
> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
>
> at
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
>
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>
>at
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
> BR
>
>
>
> Joaquin
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>
--
Best Regards
Jeff Zhang
'writeUTF']
>
> The next thing I would run into is converting the JVM RDD[String] back to
> a Python RDD, what is the easiest way to do this?
>
> Overall, is this a good approach to calling the same API in Scala and
> Python?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
--
Best Regards
Jeff Zhang
JettyServer(JettyUtils.scala:262)
>> at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
>> at
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>> at
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>> at scala.Option.foreach(Option.scala:236)
>> at org.apache.spark.SparkContext.(SparkContext.scala:481)
>> at
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
>>
>>
>>
>
--
Best Regards
Jeff Zhang
ntBatch
> def isErrorBatch(batch: EventBatch): Boolean = {
> ^
>
> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
> Error:(86, 51) not found: type SparkFlumeProtocol
> val responder = new
us library references that - although included in the
> pom.xml build - are for some reason not found when processed within
> Intellij.
>
--
Best Regards
Jeff Zhang
examples.SparkPi
>> --master yarn-client --driver-memory 512m --num-executors 2
>> --executor-memory 512m --executor-cores 210:
>>
>>
>>
>>- Error: Could not find or load main class
>>org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> but i don't config that para ,there no error why???that para is only
>> avoid Uploading resource file(jar package)??
>>
>
>
--
Best Regards
Jeff Zhang
thoughts on how to track down
> what is happening here?
>
> Thanks!
>
> Pierre.
>
--
Best Regards
Jeff Zhang
Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
e last 50
> tasks in my example can be killed (timed out) and the stage completes
> successfully.
>
> --
> Thanks,
> -Utkarsh
>
--
Best Regards
Jeff Zhang
. Won't I
> be able to do this processing in local mode then?
>
> Regards,
> Tejaswini
>
> On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> You are using local mode, --executor-memory won't take effect for local
>> mode, please use other
You are using local mode, --executor-memory won't take effect for local
mode, please use other cluster mode.
On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang <zjf...@gmail.com> wrote:
> Specify --executor-memory in your spark-submit command.
>
>
>
> On Thu, Jun 16, 2016 at 9:
Specify --executor-memory in your spark-submit command.
On Thu, Jun 16, 2016 at 9:01 AM, spR <data.smar...@gmail.com> wrote:
> Thank you. Can you pls tell How to increase the executor memory?
>
>
>
> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang <zjf...@gmail.com
k.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.ap
eption. But I have 16 gb of ram
>
> Then my notebook dies and I get below error
>
> Py4JNetworkError: An error occurred while trying to connect to the Java server
>
>
> Thank You
>
--
Best Regards
Jeff Zhang
gt;>>>>> Though I didn't find answer for your first question, I think the
>>>>>> following pertains to your second question:
>>>>>>
>>>>>>
>>>>>> spark.python.worker.memory
>>>>>> 512m
>>>>>>
>>>>>> Amount of memory to use per python worker process during
>>>>>> aggregation, in the same
>>>>>> format as JVM memory strings (e.g. 512m,
>>>>>> 2g). If the memory
>>>>>> used during aggregation goes above this amount, it will spill the
>>>>>> data into disks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <
>>>>>> carli...@janelia.hhmi.org> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have an HPC cluster that we run Spark jobs on using standalone
>>>>>>> mode and a number of scripts I’ve built up to dynamically schedule and
>>>>>>> start spark clusters within the Grid Engine framework. Nodes in the
>>>>>>> cluster
>>>>>>> have 16 cores and 128GB of RAM.
>>>>>>>
>>>>>>> My users use pyspark heavily. We’ve been having a number of problems
>>>>>>> with nodes going offline with extraordinarily high load. I was able to
>>>>>>> look
>>>>>>> at one of those nodes today before it went truly sideways, and I
>>>>>>> discovered
>>>>>>> that the user was running 50 pyspark.daemon threads (remember, this is
>>>>>>> a 16
>>>>>>> core box), and the load was somewhere around 25 or so, with all CPUs
>>>>>>> maxed
>>>>>>> out at 100%.
>>>>>>>
>>>>>>> So while the spark worker is aware it’s only got 16 cores and
>>>>>>> behaves accordingly, pyspark seems to be happy to overrun everything
>>>>>>> like
>>>>>>> crazy. Is there a global parameter I can use to limit pyspark threads
>>>>>>> to a
>>>>>>> sane number, say 15 or 16? It would also be interesting to set a memory
>>>>>>> limit, which leads to another question.
>>>>>>>
>>>>>>> How is memory managed when pyspark is used? I have the spark worker
>>>>>>> memory set to 90GB, and there is 8GB of system overhead (GPFS caching),
>>>>>>> so
>>>>>>> if pyspark operates outside of the JVM memory pool, that leaves it at
>>>>>>> most
>>>>>>> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB
>>>>>>> heap (ha ha.)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ken Carlile
>>>>>>> Sr. Unix Engineer
>>>>>>> HHMI/Janelia Research Campus
>>>>>>> 571-209-4363
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Т�ХF�
>>>>>> V�7V'67&�� R�� �â W6W"�V�7V'67&� 7 &�� 6�R��Фf�" FF�F��� � 6���
>>>>>> �G2�
>>>>>> R�� �â W6W"ֆV� 7 &�� 6�R��Р
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
>>>
>>
>>
>
--
Best Regards
Jeff Zhang
gt; at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
--
Best Regards
Jeff Zhang
bscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
>> (No such file or directory)
>>
>> any idea about this error ?
>> --
>> Thanks,
>> Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>
--
Best Regards
Jeff Zhang
,
x2*x3, x3*x1, x3*x2,x3*x3)
(3,[0,2],[1.0,1.0]) -->
(9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])|
--
Best Regards
Jeff Zhang
p/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
> INFO apache.spark.util.ShutdownHookManager - Deleting directory
> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061
>
>
>
> Sent from Yahoo Mail. Get the app <https://yho.com/148vdq>
>
--
Best Regards
Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both
driver and executor side can get this conf. And I also want this custom
hadoop conf only detected by this user's job who set this conf. Is it
possible for spark thrift server now ? Thanks
--
Best Regards
Jeff Zhang
e findings for the benefit of the community.
>>
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>
--
Best Regards
Jeff Zhang
>
> and then spark application main class fails with class not found exception.
> Is there any workaround?
>
--
Best Regards
Jeff Zhang
t
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect
> to the Java server
>
> Even though I start pyspark with these options:
> ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g
> --packages com.databricks:spark-csv_2.11:1.4.0
> --spark.deploy.recoveryMode=FILESYSTEM
>
> and this in my /conf/spark-env.sh file:
> - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM
> -Dspark.deploy.recoveryDirectory=/user/recovery"
>
> How can I get HA to work in Spark?
>
> thanks,
> imran
>
>
--
Best Regards
Jeff Zhang
-
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
ot;="268435456",
> "orc.row.index.stride"="1" )
> """
> HiveContext.sql(sqltext)
> //
> sqltext = """
> INSERT INTO TABLE test.dummy2
> SELECT
> *
> FROM tmp
> """
> HiveContext.sql(sqltext)
>
> In Spark 1.6.1, it is throwing error as below
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find
> registered driver with class oracle.jdbc.OracleDriver
>
> Is this a new bug introduced in Spark 1.6.1?
>
>
> Thanks
>
--
Best Regards
Jeff Zhang
>
>>>>> >> If the data file is same then it should have similar distribution of
>>>>> >> keys.
>>>>> >> Few queries-
>>>>> >>
>>>>> >> 1. Did you compare the number of partitions in both the cases?
>>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
>>>>> >> Program being submitted?
>>>>> >>
>>>>> >> Also, can you please share the details of Spark Context,
>>>>> Environment and
>>>>> >> Executors when you run via Scala program?
>>>>> >>
>>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>>>>> >> m.vijayaragh...@gmail.com> wrote:
>>>>> >>
>>>>> >>> Hello All,
>>>>> >>>
>>>>> >>> We are using HashPartitioner in the following way on a 3 node
>>>>> cluster (1
>>>>> >>> master and 2 worker nodes).
>>>>> >>>
>>>>> >>> val u =
>>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) =>
>>>>> (y.toInt,
>>>>> >>> x.toInt) } }).partitionBy(new
>>>>> HashPartitioner(8)).setName("u").persist()
>>>>> >>>
>>>>> >>> u.count()
>>>>> >>>
>>>>> >>> If we run this from the spark shell, the data (52 MB) is split
>>>>> across
>>>>> >>> the
>>>>> >>> two worker nodes. But if we put this in a scala program and run
>>>>> it, then
>>>>> >>> all the data goes to only one node. We have run it multiple times,
>>>>> but
>>>>> >>> this
>>>>> >>> behavior does not change. This seems strange.
>>>>> >>>
>>>>> >>> Is there some problem with the way we use HashPartitioner?
>>>>> >>>
>>>>> >>> Thanks in advance.
>>>>> >>>
>>>>> >>> Regards,
>>>>> >>> Raghava.
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Raghava
>>>>> > http://raghavam.github.io
>>>>> >
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Mike
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Raghava
>>>> http://raghavam.github.io
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Raghava
>> http://raghavam.github.io
>>
>
--
Best Regards
Jeff Zhang
d, Apr 20, 2016 at 3:55 PM, 李明伟 <kramer2...@126.com> wrote:
> Hi Jeff
>
> The total size of my data is less than 10M. I already set the driver
> memory to 4GB.
>
>
>
>
>
>
>
> 在 2016-04-20 13:42:25,"Jeff Zhang" <zjf...@gmail.com> 写道:
>
&
;http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >-----
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> >commands, e-mail: user-h...@spark.apache.org
> >
> >Information transmitted by this e-mail is proprietary to Mphasis, its
> >associated companies and/ or its customers and is intended
> >for use only by the individual or entity to which it is addressed, and may
> >contain information that is privileged, confidential or
> >exempt from disclosure under applicable law. If you are not the intended
> >recipient or it appears that this mail has been forwarded
> >to you without proper authority, you are notified that any use or
> >dissemination of this information in any manner is strictly
> >prohibited. In such cases, please notify us immediately at
> >mailmas...@mphasis.com and delete this mail from your records.
> >
>
>
>
>
>
>
>
>
>
>
--
Best Regards
Jeff Zhang
>
> hadoop.proxyuser.spark.groups
> *
>
>
>
> hadoop.proxyuser.spark.hosts
> *
>
>
> ...
>
>
> hadoop.security.auth_to_local
>
> RULE:[1:$1@$0](spark-pantagr...@contactlab.lan)s/.*/spark/
> DEFAULT
>
>
>
>
> "spark" is present as local user in all servers.
>
>
> What does is missing here ?
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
ing an OOM
> issue on the local Spark driver for some SQL code and was wondering if the
> local cache load could be the culprit.
>
> Appreciate any thoughts. BTW, we're running Spark 1.6.0 on this particular
> cluster.
>
> Regards,
>
> Soam
>
--
Best Regards
Jeff Zhang
tance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
> at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
> at
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>
>
>
--
Best Regards
Jeff Zhang
e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
Zhang <zjf...@gmail.com> wrote:
> I think I got the root cause, you can use Text.toString() to solve this
> issue. Because the Text is shared so the last record display multiple
> times.
>
> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
&g
I think I got the root cause, you can use Text.toString() to solve this
issue. Because the Text is shared so the last record display multiple
times.
On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
> Looks like a spark bug. I can reproduce it for sequ
rm that it is a bug.
>
> Thanks in advance for your help!
>
> --
> Thamme
>
--
Best Regards
Jeff Zhang
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
. 36 more
>
> }}}
>
> python2.7 couldn't found. But i m using vertual env python 2.7
> {{{
> [ram@test-work workspace]$ python
> Python 2.7.8 (default, Mar 15 2016, 04:37:00)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> }}}
>
> Can anyone help me with this?
> Thanks
>
--
Best Regards
Jeff Zhang
88 Ext: 734-6325
>
> ---
> TSMC PROPERTY
> This email communication (and any attachments) is proprietary information
> for the sole use of its
> intended recipient. Any unauthorized review, use or distribution by anyone
> other than the intended
> recipient is strictly prohibited. If you are not the intended recipient,
> please notify the sender by
> replying to this email, and then delete this email and any copies of it
> immediately. Thank you.
>
> ---
>
>
>
--
Best Regards
Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported.
On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote:
> so, how about implementing security? Any pointer will be helpful
>
> On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang <zjf..
I view spark thriftserver as a better version
> of hive one (with Spark as execution engine instead of MR/Tez) OR should I
> see it as a JDBC server?
>
> On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> spark thrift server is very similar with hive
DBC/Thrift supports security? Can we restrict certain
> users to access certain dataframes and not the others?
>
> --
> Best Regards,
> Ayan Guha
>
--
Best Regards
Jeff Zhang
wrote:
> Thanks Jeff
>
> I was looking for something like ‘unregister’
>
>
> In SQL you use drop to delete a table. I always though register was a
> strange function name.
>
> Register **-1 = unregister
> createTable **-1 == dropTable
>
> Andy
>
> From: Jeff Zh
>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> sqlContext.dropTempTable("table1")
On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> Thanks
>
> Andy
>
--
Best Regards
Jeff Zhang
, which is to
> > have some control over which saves go into which jobs, and then execute
> the
> > jobs directly. I can envision a new version of the various save functions
> > which take an extra job argument, or something, or some way to defer and
> > unblock job creation in the spark context.
> >
> > Ideas?
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
eResourceAllocation is set to true (executor.memory = 48G
> according to spark ui environment). We're also using kryo serialization and
> Yarn is the resource manager.
>
> Any ideas as what might be going wrong and how to debug this?
>
> Thanks,
> Arash
>
>
--
Best Regards
Jeff Zhang
ize(0 to 100).map(p => p + temp).
>
> I am not sure if this is a known issue, or we should file a JIRA for it.
> We originally came across this bug in the SciSpark project.
>
> Best,
>
> Rahul P
>
--
Best Regards
Jeff Zhang
marked as failed:
> container_1456905762620_0002_01_02 on host: bold-x.rice.edu. Exit status:
> 1. Diagnostics: Exception from container-launch.
>
>
> Is there anybody know what is the problem here?
> Best,
> Xiaoye
>
--
Best Regards
Jeff Zhang
D PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 /
>>> ExecutorLostFailure
>>> (executor 15 lost)
>>>
>>> here we can see executor id is 5 but executor logs itself doesn't use
>>> this id as reference in log stream so it's hard to cross check logs.
>>>
>>>
>>> Anyhow my main issue is to determine cause of executor killing.
>>>
>>>
>>> Thanks
>>>
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
>>> <https://twitter.com/Xactly> [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp> [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
> <https://twitter.com/Xactly> [image: Facebook]
> <https://www.facebook.com/XactlyCorp> [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
--
Best Regards
Jeff Zhang
the driver side?
>
>
> On Sunday, February 28, 2016, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> data skew might be possible, but not the common case. I think we should
>> design for the common case, for the skew case, we may can set some
>> parameter of fraction to allow
toDF is not a member of Array[(String, Int)]
> weights.toDF("weights","value")
>
> I want to label columns and print out the contents in value order please I
> don't know why I am getting this error
>
> Thanks
>
>
--
Best Regards
Jeff Zhang
you can already do what you proposed by creating
> identical virtualenvs on all nodes on the same path and change the spark
> python path to point to the virtualenv.
>
> Best Regards,
> Mohannad
> On Mar 1, 2016 06:07, "Jeff Zhang" <zjf...@gmail.com> wrote:
>
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
.virtualenv.path (path to the executable for for
virtualenv/conda)
Best Regards
Jeff Zhang
have skew and almost all the result data are in
> one or a few tasks though.
>
>
> On Friday, February 26, 2016, Jeff Zhang <zjf...@gmail.com> wrote:
>
>>
>> My job get this exception very easily even when I set large value of
>> spark.driver.maxRe
the
> union of multiple data-types.
>
> Thanks,
> AnoopShiralige
>
>
> On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Avro Record is not supported by pickler, you need to create a custom
>> pickler for it. But I don't think it
ssage in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dynamic-allocation-Spark-tp26344.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
s, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
(1085.0 MB)
is bigger than spark.driver.maxResultSize (1024.0 MB)
--
Best Regards
Jeff Zhang
ext:
> http://apache-spark-user-list.1001560.n3.nabble.com/When-I-merge-some-datas-can-t-go-on-tp26341.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: us
--
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
th this option I cant see the currently running jobs in Spark WEB UI
>> though it later appear in spark history server.
>>
>> My question with which --master option should I run my spark jobs so that
>> I can view the currently running jobs in spark web UI .
>>
>> Thanks,
>> Divya
>>
>
--
Best Regards
Jeff Zhang
be
> defined in data frame for each table rather than importing the whole
> columns.
>
>
>
> Thanks,
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
--
Best Regards
Jeff Zhang
gainst-LocalRelation-tp26093.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
ises
>>>
>>> _http://www.inside-r.org/packages/cran/arules/docs/discretize
>>>
>>> R code for example :
>>>
>>> ### equal frequency
>>> table(discretize(data$some_column, "frequency", categories=10))
>>>
>>>
>>> #k-means
>>> table(discretize(data$some_column, "cluster", categories=10))
>>>
>>> Thanks a lot !
>>>
>>
>>
>>
>> --
>> Joshua Taylor, http://www.cs.rpi.edu/~tayloj/
>>
>
>
--
Best Regards
Jeff Zhang
LTask: org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:file:/user/hive/warehouse/src is not a directory or
> unable to create one)
>
> How to use HDFS instead of local file system(file)?
> Which parameter should to set?
>
> Thanks a lot.
>
>
> Peter Zhang
> --
> Google
> Sent with Airmail
>
--
Best Regards
Jeff Zhang
rk.deploy.SparkSubmitDriverBootstrapper
>>
>> If I replace deploy-mode to cluster the job is submitted successfully.
>> Is there a dependency missing from my project? Right now only one I
>> included is spark-streaming 1.6.0.
>>
>
>
--
Best Regards
Jeff Zhang
1 - 100 of 194 matches
Mail list logo