Hello Jayant,
Thank you so much for suggestion. My view was to use Python function as
transformation which can take couple of column names and return object.
which you explained. would that possible to point me to similiar codebase
example.
Thanks.
On Fri, Jul 6, 2018 at 2:56 AM, Jayant
Hi
We are migrating our Direct Streaming Spark job to Structured Streaming. We
have a batch size of 1 minute.
I am consistently seeing that the Structured Streaming job is always (3-5
minutes) behind the Direct Streaming job. Is there some kinda fine tuning
that will help Structured Streaming
Hi
I am trying to explore how I can use UDAF for my use case.
I have something like this in my Structured Streaming Job.
val counts: Dataset[(String, Double)] = events
.withWatermark("timestamp", "30 minutes")
.groupByKey(e => e._2.siteIdentifier + "|" +
thanks
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Only the stream metadata (e.g., streamid, offsets) are stored as json. The
stream state data is stored in an internal binary format.
On Mon, Jul 9, 2018 at 4:07 PM, subramgr
wrote:
> Hi,
>
> I read somewhere that with Structured Streaming all the checkpoint data is
> more readable (Json) like.
Hi,
I read somewhere that with Structured Streaming all the checkpoint data is
more readable (Json) like. Is there any documentation on how to read the
checkpoint data.
If I do `hadoop fs -ls` on the `state` directory I get some encoded data.
Thanks
Girish
--
Sent from:
It's still under design review. It's unlikely that it will go into 2.4.
On Mon, Jul 9, 2018 at 3:46 PM trung kien wrote:
> Thanks Li,
>
> Inread through the ticket, be able to pass pod YAML file would be amazing.
>
> Do you have any target date for production or incubator? I really want to
>
Thanks Li,
Inread through the ticket, be able to pass pod YAML file would be amazing.
Do you have any target date for production or incubator? I really want to
try out this feature.
On Mon, Jul 9, 2018 at 4:48 PM Yinan Li wrote:
> Spark on k8s currently doesn't support specifying a custom
Folks,
I am writing some Scala/Java code and want it to be usable from pyspark.
For example:
class MyStuff(addend: Int) {
def myMapFunction(x: Int) = x + addend
}
I want to call it from pyspark as:
df = ...
mystuff = sc._jvm.MyStuff(5)
df[‘x’].map(lambda x: mystuff.myMapFunction(x))
Spark on k8s currently doesn't support specifying a custom SecurityContext
of the driver/executor pods. This will be supported by the solution to
https://issues.apache.org/jira/browse/SPARK-24434.
On Mon, Jul 9, 2018 at 2:06 PM trung kien wrote:
> Dear all,
>
> Is there any way to includes
Dear all,
Is there any way to includes security context (
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
when submitting job through k8s servers?
I'm trying to first spark jobs on Kubernetes through spark-submit:
bin/spark-submit --master k8s://https://API_SERVERS
That usually happens when you have different types for a column in some
parquet files.
In this case, I think you have a column of `Long` type that got a file with
`Integer` type, I had to deal with similar problem once.
You would have to cast it yourself to Long.
On Mon, Jul 9, 2018 at 2:53 PM
Try doing `unpersist(blocking=true)`
On Mon, Jul 9, 2018 at 2:59 PM Jeffrey Charles
wrote:
>
> I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled
> to get a sense of how much memory the dataframe takes up. After I note the
> size, I unpersist the dataframe. For some
I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled
to get a sense of how much memory the dataframe takes up. After I note the
size, I unpersist the dataframe. For some reason, Yarn is not releasing the
executors that were added to Zeppelin. If I don't run the persist and
I am getting following error after performing joins between 2 dataframe. It
happens on call to .show() method. I assume it's an issue with incompatible
type but it's been really hard to identify which column of which dataframe
have that incompatibility.
Any pointers?
11:06:10.304 13700 [Executor
Hello list,
We are running Apache Spark on a Mesos cluster and we face a weird behavior of
executors. When we submit an app with e.g 10 cores and 2GB of memory and max
cores 30, we expect to see 3 executors running on the cluster. However,
sometimes there are only 2... Spark applications are
Spark gives a nice rest api to get metrics
https://spark.apache.org/docs/latest/monitoring.html#rest-api
The problem is that this API is based on application id, which can change if
we are running in supervise mode.
Any application which is created based on the rest-api has to deal with
changing
Greetings, Apache software enthusiasts!
(You’re getting this because you’re on one or more dev@ or users@ lists
for some Apache Software Foundation project.)
ApacheCon North America, in Montreal, is now just 80 days away, and
early bird prices end in just two weeks - on July 21. Prices will
@yohann Looks like something is wrong with my environment which I am yet to
figure out but the theory so far makes sense and I had also tried it in
another environments with very minimal configuration like my environment
and it works fine so clearly something is wrong with my env I don't know
why
Anyone?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi Swetha,
I also had the same requirement reading from json from kafka and writing
back to parquet format.
I did a work around :
1. Inferred the schema using the batch api by reading first few rows
2. started streaming using the inferred schema in step1
*Limitation*: Will not work if you
Thanks Amiya/TD for responding.
@TD,
Thanks for letting us know about this new foreachBatch api, this handle of
per batch dataframe should be useful in many cases.
@Amiya,
The input source will be read twice, entire dag computation will be done
twice. Not limitation but resource utilisation and
22 matches
Mail list logo