Congrats, Great work Dongjoon.
Dongjoon Hyun 于2019年1月15日周二 下午3:47写道:
> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users
We are happy to announce the availability of Spark 2.2.3!
Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to
upgrade to this stable release.
To download Spark 2.2.3, head over to the download page:
http
Hey Guys,
Just launched a monthly Apache Spark Newsletter.
https://newsletterspot.com/apache-spark/
Cheers,
Ankur
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
Severity: Low
Vendor: The Apache Software Foundation
Versions Affected:
All versions of Apache Spark
Description:
Spark's standalone resource manager accepts code to execute on a 'master' host,
that then runs that code on 'worker' hosts. The master itself does not, by
design, execute user code
My previous answers to this question can be found in the archives, along
with some other responses:
http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
I have made a couple of presentations
Hard to answer in a succinct manner but I'll give it a shot.
Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to
behaviour driven development, BDD).
It is not a mere *technical* approach to testing but a mindset, a way of
work and a different (different, whether it is
Sparklens from qubole is a good source. Other tests are to be handled by
developer.
Best,
Ravi
On Thu, Nov 15, 2018, 12:45 PM Hi all,
>
>
>
> How are you testing your Spark applications?
>
> We are writing features by using Cucumber. This is testing the behaviours.
> Is this called functional
Hi all,
How are you testing your Spark applications?
We are writing features by using Cucumber. This is testing the behaviours. Is
this called functional test or integration test?
We are also planning to write unit tests.
For instance we have a class like below. It has one method. This methos
size ?
On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin
wrote:
> +user@
>
> >> -- Forwarded message -
> >> From: Wenchen Fan
> >> Date: Thu, Nov 8, 2018 at 10:55 PM
> >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>
he release manager,
>>>>> Wenchen!
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan
>>>>> wrote:
>>>>>
>>>
>>>>>
>>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan
>>>>> wrote:
>>>>>
>>>>>> resend
>>>>>>
>>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan
>>>>>> wrote:
>&
>>>>
>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote:
>>>>
>>>>> resend
>>>>>
>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan
>>>>> wrote:
>>>>>
>>>>>>
>
nchen Fan wrote:
>>
>>> + user list
>>>
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote:
>>>
>>>> resend
>>>>
>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan
>>>> wrote:
>>>>
>>>>
t;>>
>>>>
>>>>
>>>> ------ Forwarded message -
>>>> From: Wenchen Fan
>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>> To: Spark dev list
&
hen Fan wrote:
>> + user list
>>
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote:
>>> resend
>>>
>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote:
>>>>
>>>>
>>>> -- Forwarded message -
at 11:02 PM Wenchen Fan wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list
>>>
&g
+ user list
On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote:
> resend
>
> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Wenchen Fan
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> S
+user@
>> -- Forwarded message -
>> From: Wenchen Fan
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the
age 0.0 (TID 40)
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
shdown": "true
>
> test table in Hive is pointing to hdfs://test/ and partitioned on date
>
> val sqlStr = s"select * from test where date > 20181001"
> val logs = spark.sql(sqlStr)
>
> With Hive query I don't see filter pushdown is happening. I tried sett
tr)
With Hive query I don't see filter pushdown is happening. I tried setting
these configs in both hive-site.xml and also spark.sqlContext.setConf
"hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"
--
Sent from: http://apache-
orc files I tried running Hive query on
> same dataset. But I was not able to push filter predicate. Where should I
> set the below config's "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
>
> Sug
ut I was not able to push filter predicate. Where should I
set the below config's "hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"
Suggest what is the best way to read orc files from
n alang wrote:
>> Hello
>> - is there a "performance" difference when using Java or Scala for Apache
>> Spark ?
>>
>> I understand, there are other obvious differences (less code with scala,
>> easier to focus on logic etc),
>> but wrt performance - i think
how about Python.
java vs scala vs python vs R
which is better.
On Sat, Oct 27, 2018 at 3:34 AM karan alang wrote:
> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I understand, there are other obvious differences (less
>> Scala. Now if you want to do data science, Java is probably not the best
>> tool yet...
>>
>> On Oct 26, 2018, at 6:04 PM, karan alang wrote:
>>
>> Hello
>> - is there a "performance" difference when using Java or Scala for Apache
>> Spark ?
of your team to
> Scala. Now if you want to do data science, Java is probably not the best
> tool yet...
>
> On Oct 26, 2018, at 6:04 PM, karan alang wrote:
>
> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I unders
...
> On Oct 26, 2018, at 6:04 PM, karan alang wrote:
>
> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I understand, there are other obvious differences (less code with scala,
> easier to focus on logic etc)
On Oct 27, 2018 3:34 AM, "karan alang" wrote:
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?
I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think the
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?
I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are
Severity: Low
Vendor: The Apache Software Foundation
Versions Affected:
1.3.x release branch and later, including master
Description:
Spark's Apache Maven-based build includes a convenience script, 'build/mvn',
that downloads and runs a zinc server to speed up compilation. This server
will
:
> https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
>
> It seems also interesting.
>
> I was in meeting, I will also watch it.
>
>
>
> *From: *Gourav Sengupta
> *Date: *24 October 2018 Wednesday 13:39
&
Thank you Gourav,
Today I saw the article:
https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
It seems also interesting.
I was in meeting, I will also watch it.
From: Gourav Sengupta
Date: 24 October 2018 Wednesday 13:39
To: *"Ozsakarya, Omer"
> *Cc: *Spark Forum
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 O
Thank you very much
From: Gourav Sengupta
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer"
Cc: Spark Forum
Subject: Re: Triggering sql on Was S3 via Apache Spark
This is interesting you asked and then answered the questions (almost) as well
Regards,
Gourav
On Tue, 2
This is interesting you asked and then answered the questions (almost) as
well
Regards,
Gourav
On Tue, 23 Oct 2018, 13:23 , wrote:
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
ver-on-prem-put-into-s3-bucket
>
> Thanks,
> Divya
>
>> On Tue, 23 Oct 2018 at 15:53, wrote:
>> Hi guys,
>>
>>
>>
>> We are using Apache Spark on a local machine.
>>
>>
>>
>> I need to implement the scenario b
-ftp-server-on-prem-put-into-s3-bucket
Thanks,
Divya
On Tue, 23 Oct 2018 at 15:53, wrote:
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM applic
Hi guys,
We are using Apache Spark on a local machine.
I need to implement the scenario below.
In the initial load:
1. CRM application will send a file to a folder. This file contains customer
information of all customers. This file is in a folder in the local server.
File name
Hi,
Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh,
Durham, and Chapel Hill in North Carolina, USA. The group started back in July
2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/
<https://www.meetup.com/Triangle-Apache-Spark-Mee
We are happy to announce the availability of Spark 2.3.2!
Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.
To download Spark 2.3.2, head over to the download page:
http
I have a docker based cluster. In my cluster, I try to schedule spark jobs
by using Airflow. Airflow and Spark are running separately in *different
containers*. However, I cannot run a spark job by using airflow.
Below the code is my airflow script:
from airflow import DAG
from
Hello,
we are creating a new meetup of enthusiast Apache Spark Users
in Italy at Padova
https://www.meetup.com/Padova-Apache-Spark-Meetup/
Is it possible to add the meetup link to the web page
https://spark.apache.org/community.html ?
Moreover is it possible to announce future
Severity: Medium
Vendor: The Apache Software Foundation
Versions Affected:
Spark versions from 1.3.0, running standalone master with REST API enabled,
or running Mesos master with cluster mode enabled
Description:
>From version 1.3.0 onward, Spark's standalone master exposes a REST API for
job
Hello,
I have very general question about Apache Spark. I want to know if it is
possible(and where to start, if possible) to implement a data quality
measurement prototype for streaming data using Apache Spark. Let's say I
want to work on Timeliness or Completeness as a data quality metrics
We try to create a cluster which consists of 4 machines. The cluster will
be used by multiple-users. How can we configured that user can submit jobs
from personal computer and is there any free tool you can suggest to
leverage procedure.
--
Uğur Sopaoğlu
Severity: Medium
Vendor: The Apache Software Foundation
Versions Affected:
Spark versions through 2.1.2
Spark 2.2.0 through 2.2.1
Spark 2.3.0
Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's
possible for a malicious user to construct a URL pointing
Severity: High
Vendor: The Apache Software Foundation
Versions affected:
Spark versions through 2.1.2
Spark 2.2.0 to 2.2.1
Spark 2.3.0
Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when
using PySpark or SparkR, it's possible for a different local user
We are happy to announce the availability of Spark 2.2.2!
Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade
to this stable release. The release notes are available at
http://spark.apache.org
We are happy to announce the availability of Spark 2.1.3!
Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1
maintenance branch of Spark. We strongly recommend all 2.1.x users to
upgrade to this stable release. The release notes are available at
http://spark.apache.org/releases
with these
data strings.
I’m trying to understand if Apache Spark can fit my use case. The only input
data will be these strings from this file. Can I correlate these events and
how? Is there a GUI to do it?
Any hints and advices will be appreciated.
Best regards,
Simone
--
Sent from: http://apache
We are happy to announce the availability of Spark 2.3.1!
Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.
To download Spark 2.3.1, head over to the download page:
http
though after the
> checkpoint dir is deleted ,
>
> I don't know how spark do this without checkpoint's metadata.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> --
I met the same issue and I have try to delete the checkpoint dir before the
job ,
But spark seems can read the correct offset even though after the
checkpoint dir is deleted ,
I don't know how spark do this without checkpoint's metadata.
--
Sent from: http://apache-spark-user-list.1001560.n3
You probably want to recognize "spark-shell" as a command in your
environment. Maybe try "sudo ln -s /path/to/spark-shell
/usr/bin/spark-shell" Have you tried "./spark-shell" in the current path
to see if it works?
Thank You,
Irving Duran
On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote:
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ spark-shell
spark-shell: command not found
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ Spark.odt
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t9314/Spark.odt>
--
Sent from: http://apache-spark-user-list.1001560.n3.nabb
Hi,
We use Apache Spark 2.2.0 in our stack. Our software by default like other
softwares gets installed under "C:\Program Files\". We have a
restriction that we cannot ask our customers to enable short names on their
machines. From our experience, spark does not handle the absolute
- saome manipulation happening here and finally return a array of rows
return res[Row]
}
Could you please someone help me what causing the issues here.i have tested
the import spark.implicits not working. how to fix this error or else help
me in different approach here
Hi:
I am using Apache Spark Structured Streaming (2.2.1) to implement custom
sessionization for events. The processing is in two steps:1.
flatMapGroupsWithState (based on user id) - which stores the state of user and
emits events every minute until a expire event is received
2. The next step
Hi:
I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState
and a groupBy count operators.
In the StreamExecution logs I see two enteries for stateOperators
"stateOperators" : [ {
"numRowsTotal" : 1617339,
"numRowsUpdated" : 9647
}, {
"numRowsTotal" :
Hi:
I am working on spark structured streaming (2.2.1) with kafka and want 100
executors to be alive. I set spark.executor.instances to be 100. The process
starts running with 100 executors but after some time only a few remain which
causes backlog of events from kafka.
I thought I saw a
Hi:
I am working on a realtime application using spark structured streaming (v
2.2.1). The application reads data from kafka and if there is a failure, I
would like to ignore the checkpoint. Is there any configuration to just read
from last kafka offset after a failure and ignore any offset
> version etc, please let me know.
>
> Thanks
>
> Here is the exception stack trace.
>
> java.util.concurrent.TimeoutException: Cannot fetch record for offset
> <offset#> in 120000 milliseconds
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$
>
ondsat
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:219)
at
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
at
org.apache.spark.sql.ka
e look at the UI if not already it can provide lot of information
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka
(0.11).
I need to aggregate data ingested every minute and I am using spark-shell at
the moment. The message rate ingestion rate is approx 500k/second. During
some trigger intervals (1 minute) especially when
Thanks Richard. I am hoping that Spark team will at some time, provide more
detailed documentation.
On Sunday, February 11, 2018 2:17 AM, Richard Qiao
wrote:
Can find a good source for documents, but the source code
Can find a good source for documents, but the source code
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to
answer some of them.
For example:
inputRowsPerSecond = numRecords / inputTimeSec,
processedRowsPerSecond = numRecords / processingTimeSec
This is explaining
Just checking if anyone has any pointers for dynamically updating query state
in structured streaming.
Thanks
On Thursday, February 8, 2018 2:58 PM, M Singh
wrote:
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that
Hi:
I am working with spark 2.2.0 and am looking at the query status console
output.
My application reads from kafka - performs flatMapGroupsWithState and then
aggregates the elements for two group counts. The output is send to console
sink. I see the following output (with my questions
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1) This value is
applied to a column (eg: subtract the variable from the column value
Free access to Index Conf for Apache Spark session attendees. For info go to:
https://www.meetup.com/SF-Big-Analytic
IBM is hosting a developer conference - Essentially the conference is ‘By
Developers, for Developers’ based on Open technologies.
This will be held Feb 20 - 22nd in Moscone West
Hi Jacek:
Thanks for your response.
I am just trying to understand the fundamentals of watermarking and how it
behaves in aggregation vs non-aggregation scenarios.
On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski
wrote:
Hi,
What would you expect? The data is
Hi,
What would you expect? The data is simply dropped as that's the purpose of
watermarking it. That's my understanding at least.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time
(TraversableLike.scala:234) at
scala.collection.AbstractTraversable.map(Traversable.scala:104) at
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.sca
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.
Vishnu - Spark documentation
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
does indicate that it can dedup using watermark. So I believe
5) at
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala
Could you give the full stack trace of the exception?
Also, can you do `dataframe2.explain(true)` and show us the plan output?
On Wed, Jan 31, 2018 at 3:35 PM, M Singh
wrote:
> Hi Folks:
>
> I have to add a column to a structured *streaming* dataframe but when I
Hi Folks:
I have to add a column to a structured streaming dataframe but when I do that
(using select or withColumn) I get an exception. I can add a column in
structured non-streaming structured dataframe. I could not find any
documentation on how to do this in the following doc
Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the
even it delayed more than when the state is cleared by Spark, then it will
be ignored.
I recently wrote a blog post on this :
http://vishnuviswanath.com/spark_structured_streaming.html#watermark
Yes, this State is
Hi, Nicolas.
Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.
1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1
Hi
Thanks for this work.
Will this affect both:
1) spark.read.format("orc").load("...")
2) spark.sql("select ... from my_orc_table_in_hive")
?
Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> Hi, All.
>
> Vectorized ORC Reader is now suppor
Hi,
I'm curious how would you do the requirement "by a certain amount of time"
without a watermark? How would you know what's current and compute the lag?
Let's forget about watermark for a moment and see if it pops up as an
inevitable feature :)
"I am trying to filter out records which are
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the
Thanks TD. When will 2.3 scheduled for release ?
On Thursday, January 25, 2018 11:32 PM, Tathagata Das
wrote:
Hello Mans,
The streaming DataSource APIs are still evolving and are not public yet. Hence
there is no official documentation. In fact, there is a new
Hello Mans,
The streaming DataSource APIs are still evolving and are not public yet.
Hence there is no official documentation. In fact, there is a new
DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this
point of time, it's hard to make any concrete suggestion. You can take a
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package:
private[sql] def
Jacek lawskowski on this mail list wrote a book which is available
online.
Hth
On Jan 18, 2018 6:16 AM, "Manuel Sopena Ballesteros" <
manuel...@garvan.org.au> wrote:
> Dear Spark community,
>
>
>
> I would like to learn more about apache spark. I have a Horto
Dear Spark community,
I would like to learn more about apache spark. I have a Horton works HDP
platform and have ran a few spark jobs in a cluster but now I need to know more
in depth how spark works.
My main interest is sys admin and operational point of Spark and it's ecosystem
Hi, All.
Vectorized ORC Reader is now supported in Apache Spark 2.3.
https://issues.apache.org/jira/browse/SPARK-16060
It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.
Thank you for all your support, especially Wenchen Fan.
It's done by two
Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0
My current best guess is that Spark does not fully support Hadoop 3.x because
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for
Hadoop 3.x) has not been resolved. There
My current best guess is that Spark does *not* fully support Hadoop 3.x
because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive
shims for Hadoop 3.x) has not been resolved. There are also likely to be
transitive dependency conflicts which will need to be resolved.
On Mon, Jan
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
later', but my confusion is because spark was released on 1st dec and
hadoop-3 stable version released on 13th Dec. And to my similar question
on stackoverflow.com
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
is not clear whether it is supported or not (or has some issues). I think
in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
Thanks
Jerry
Hi Akshay
On the Spark Download page when you select Spark 2.2.1 it gives you an
option to select package type. In that, there is an option to select
"Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
does support Hadoop 3.0.
http://spark.apache.org/downloads.html
hello Users,
I need to know whether we can run latest spark on latest hadoop version
i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.
Hi Jacek:
The javadoc mentions that we can only consume data from the data frame in the
addBatch method. So, if I would like to save the data to a new sink then I
believe that I will need to collect the data and then save it. This is the
reason I am asking about how to control the size of
Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)
> Do
401 - 500 of 1006 matches
Mail list logo