Hi,
I am using spark k mean for clustering records that consist of news
documents, vectors are created by applying tf-idf. Dataset that I am using
for testing right now is the gold-truth classified
http://qwone.com/~jason/20Newsgroups/
Issue is all the documents are getting assigned to same
Hi Michael,
Thanks much for the suggestion.
I was wondering - whats the best way to deserialize the 'value' field
On Fri, Mar 24, 2017 at 11:47 AM, Michael Armbrust
wrote:
> Encoders can only map data into an object if those columns already exist.
> When we are
I missed the part to pass in a schema to force the "struct" to a Map, then use
explode. Good option.
Yong
From: Michael Armbrust
Sent: Friday, March 24, 2017 3:02 PM
To: Yong Zhang
Cc: Selvam Raman; user
Subject: Re: how to read object
Thanks for the reply. Yeah I found the same doc and am able to use multiple
cores in spark-shell, however, when I use pyspark, it appears only to use
one core, I am wondering if this is something I did't configure correctly
or something supported in pyspark.
On Fri, Mar 24, 2017 at 3:52 PM,
In Local Mode all processes are executed inside a single JVM.
Application is started in a local mode by setting master to local, local[*] or
local[n].
spark.executor.cores and spark.executor.cores are not applicable in the local
mode because there is only one embedded executor.
In Standalone
Hi,
I am wondering does pyspark standalone (local) mode support multi
cores/executors?
Thanks,
Li
Encoders can only map data into an object if those columns already exist.
When we are reading from Kafka, we just get a binary blob and you'll need
to help Spark parse that first. Assuming your data is stored in JSON it
should be pretty straight forward.
streams = spark
.readStream()
I'm not sure you can parse this as an Array, but you can hint to the parser
that you would like to treat source as a map instead of as a struct. This
is a good strategy when you have dynamic columns in your data.
Here is an example of the schema you can use to parse this JSON and also
how to use
Hi,
Currently , encountering the following exception while working with
below-mentioned code snippet :
> Please suggest the correct approach for reading the stream into a sql
> schema.
> If I add 'tweetSchema' while reading stream, it errors out with message -
> we can not change static schema
Hi all,
I am trying to trap UI kill event of a spark application from driver.
Some how the exception thrown is not propagated to the driver main
program. See for example using spark-shell below.
Is there a way to get hold of this event and shutdown the driver program?
Regards,
Noorul
Not sure if anyone else here can help you. But if I were you, I will adjust
SPARK_DAEMON_MEMORY to 2g, to bump the worker to 2G. Even though the worker's
responsibility is very limited, but in today's world, who knows. Give 2g a try
to see if the problem goes away.
BTW, in our production, I
Yea we also didn't find anything related to this online.
Are you aware of any memory leaks in worker in 1.6.2 spark which might be
causing this ?
Do you know of any documentation which explains all the tasks that a worker
is performing ? Maybe we can get some clue from there.
Regards,
Behroz
On
I never experienced worker OOM or very rarely see this online. So my guess that
you have to generate the heap dump file to analyze it.
Yong
From: Behroz Sikander
Sent: Friday, March 24, 2017 9:15 AM
To: Yong Zhang
Cc: user@spark.apache.org
Of course it is possible.
You can always to set any configurations in your application using API, instead
of pass in through the CLI.
val sparkConf = new
SparkConf().setAppName(properties.get("appName")).set("master",
properties.get("master")).set(xxx, properties.get("xxx"))
Your error is
Thank you for the response.
Yes, I am sure because the driver was working fine. Only 2 workers went
down with OOM.
Regards,
Behroz
On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang wrote:
> I am not 100% sure, but normally "dispatcher-event-loop" OOM means the
> driver OOM.
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver
OOM. Are you sure your workers OOM?
Yong
From: bsikander
Sent: Friday, March 24, 2017 5:48 AM
To: user@spark.apache.org
Subject: [Worker Crashing]
Hi,
I am trying to deploy spark job by using spark-submit which has bunch of
parameters like
spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode
cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf
spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf
Also noticed that there are 8 - "dispatcher-event-loop-0 7" and 8 -
"map-output-dispatcher-0 7" all waiting at the same location in the
code that is -
*sun.misc.Unsafe.park(Native Method)*
*java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)*
Hi All,
My Spark job hangs here... Looking into the thread dump I noticed that it
hangs here (stack trace given below) on the count action on dataframe
(given below). Data is very small. Its actually not more than even 10 rows.
I noticed some JIRAs about this issue but all are resolved-closed in
Spark version: 1.6.2
Hadoop: 2.6.0
Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)
Recently, 2 of our workers went down with out of memory exception.
java.lang.OutOfMemoryError: GC overhead limit exceeded (max
Hi,
I wanted to skip all the headers of CSVs present in a directory.
After searching on Google I got to know that it can be done using
sc.wholetextfiles.
Can any one suggest me how to do that in Scala.?
Thanks & Regards,
Nayan Sharma
Maybe an udf to flatten is an interesting option as well.
http://stackoverflow.com/q/42888711/2587904 would a uadf very more
performant?
shyla deshpande schrieb am Fr. 24. März 2017 um
04:04:
> Thanks a million Yong. Great help!!! It solved my problem.
>
> On Thu, Mar
22 matches
Mail list logo