We are happy to announce the availability of Spark 2.1.1!
Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1
maintenance branch of Spark. We strongly recommend all 2.1.x users to
upgrade to this stable release.
To download Apache Spark 2.1.1 visit http://spark.apache.org/downloa
Sorry, I had a typo I mean repartitionby("fieldofjoin)
El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed"
escribió:
Hi Angel,
I am trying using the below code but i dont see partition on the dataframe.
val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)
import sqlContext._
Hi Angel,
I am trying using the below code but i dont see partition on the dataframe.
val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)
import sqlContext._
import sqlContext.implicits._
datapoint_prq_df.join(geoCacheLoc_df)
Val tableA = DfA.partitionby("joinField").f
Unfortunately there is not an easy way to add nested columns (though I do
think we should implement the API you attempted to use).
You'll have to build the struct manually.
allData.withColumn("student", struct($"student.name",
coalesce($"student.age", lit(0)) as 'age)
You could automate the cons
One, I think, you should take this to the spark developer list.
Two, I suspect broadcast variables aren't the best solution for the use
case, you describe. Maybe an in-memory data/object/file store like tachyon
is a better fit.
Thanks,
Tim
On Tue, May 2, 2017 at 11:56 AM, Nipun Arora
wrote:
Hi All,
To support our Spark Streaming based anomaly detection tool, we have made a
patch in Spark 1.6.2 to dynamically update broadcast variables.
I'll first explain our use-case, which I believe should be common to
several people using Spark Streaming applications. Broadcast variables are
often
You don't need write ahead logs for direct stream.
On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote:
> Hi All,
>
> I need some fault tolerance for my stateful computations and I am wondering
> why we need to enable writeAheadLogs for DirectStream like Kafka (for
> Indirect stream it makes sense
Yes, I noticed these open issues, both with KMeans and GMM:
https://issues.apache.org/jira/browse/SPARK-13025
Thanks,
Tim
On Mon, May 1, 2017 at 9:01 PM, Yanbo Liang wrote:
> Hi Tim,
>
> Spark ML API doesn't support set initial model for GMM currently. I wish
> we can get this feature in Spar
Seems like
https://issues.apache.org/jira/browse/SPARK-13346
is likely the same issue.
Seems like for some people persist() doesn't work and they have to convert
to RDDs and back.
On Fri, Apr 14, 2017 at 1:39 PM, Everett Anderson wrote:
> Hi,
>
> We keep hitting a situation on Spark 2.0.2 (h
Hi Michael,
Thank you for the suggestions. I am wondering how I can make `withColumn`
to handle nested structure?
For example, below is my code to generate the data. I basically add the
`age` field to `Person2`, which is nested in an Array for Course2. Then I
want to fill in 0 for age with age is
Have you tried to make partition by join's field and run it by segments,
filtering both tables at the same segments of data?
Example:
Val tableA = DfA.partitionby("joinField").filter("firstSegment")
Val tableB= DfB.partitionby("joinField").filter("firstSegment")
TableA.join(TableB)
El 2 may
Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is for
one month i.e. for April
Table 2: 92 GB not partitioned .
I have to perform join on these tables now.
On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:
> Hello,
>
> Is the
Hello,
Is the tables partitioned?
If yes, what is the partition field?
Thanks
El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed"
escribió:
Hi,
I am trying to join two big tables in spark and the job is running for
quite a long time without any results.
Table 1: 192GB
Table 2: 92 GB
Does any
Hi,
I am trying to join two big tables in spark and the job is running for
quite a long time without any results.
Table 1: 192GB
Table 2: 92 GB
Does anyone have better solution to get the results fast?
Thanks,
Asmath
Hi All,
I need some fault tolerance for my stateful computations and I am wondering
why we need to enable writeAheadLogs for DirectStream like Kafka (for
Indirect stream it makes sense). In case of driver failure DirectStream
such as Kafka can pull the messages again from the last committed offset
I see.Thanks!
On Tue, May 2, 2017 at 9:12 AM, Marcelo Vanzin wrote:
> On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote:
> > I have no easy way to pass jar path to those forked Spark
> > applications? (except that I download jar from a remote path to a local
> temp
> > dir after resolving some
On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote:
> I have no easy way to pass jar path to those forked Spark
> applications? (except that I download jar from a remote path to a local temp
> dir after resolving some permission issues, etc.?)
Yes, that's the only way currently in client mode.
--
Ma
Thanks for the reply! If I have an application master which starts some
Spark applications by forking processes (in yarn-client mode)
Essentially I have no easy way to pass jar path to those forked Spark
applications? (except that I download jar from a remote path to a local
temp dir after resolvi
Remote jars are added to executors' classpaths, but not the driver's.
In YARN cluster mode, they would also be added to the driver's class
path.
On Tue, May 2, 2017 at 8:43 AM, Nan Zhu wrote:
> Hi, all
>
> For some reason, I tried to pass in a HDFS path to the --jars option in
> spark-submit
>
>
Hi, all
For some reason, I tried to pass in a HDFS path to the --jars option in
spark-submit
According to the document,
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management,
--jars would accept remote path
However, in the implementation,
https://github.
Hi Spark Users,
I have a dataset with ~5M rows x 20 columns, containing a groupID and a
rowID. My goal is to check whether (some) columns contain more than a
fixed fraction (say, 50%) of missing (null) values within a group. If
this is found, the entire column is set to missing (null), for tha
21 matches
Mail list logo