Hi Spark Users,
Is there any Job/Career channel for Apache Spark?
Thank you
Hi Dear Spark Users,
It has been many years that I have worked on Spark, Please help me. Thanks
much
I have different cities and their co-ordinates in DataFrame[Row], I want to
find distance in KMs and then show only those records /cities which are 10
KMs far.
I have a function created that can
Hi Everyone, I need help on my Airflow DAG which has Spark Submit and Now I
have Kubernetes Cluster instead Hortonworks Linux Distributed Spark Cluster.My
existing Spark-Submit is through BashOperator as below:
calculation1 = '/usr/hdp/2.6.5.0-292/spark2/bin/spark-submit --conf
I am looking for any built-in API if at all exists?
On Tue, Jun 22, 2021 at 1:16 PM Chetan Khatri
wrote:
> this has been very slow
>
> On Tue, Jun 22, 2021 at 1:15 PM Sachit Murarka
> wrote:
>
>> Hi Chetan,
>>
>> You can substract the data frame or use excep
gt;
> hope this helps
>
> Thanks
> Sachit
>
> On Tue, Jun 22, 2021, 22:23 Chetan Khatri
> wrote:
>
>> Hi Spark Users,
>>
>> I want to use DropDuplicate, but those records which I discard. I
>> would like to log to the instrumental table.
>>
>> What would be the best approach to do that?
>>
>> Thanks
>>
>
Hi Spark Users,
I want to use DropDuplicate, but those records which I discard. I
would like to log to the instrumental table.
What would be the best approach to do that?
Thanks
May 5, 2021 at 10:15 PM Chetan Khatri
wrote:
> Hi All, Collect in spark is taking huge time. I want to get list of values
> of one column to Scala collection. How can I do this?
> val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
> .select(col("
Hi All, Collect in spark is taking huge time. I want to get list of values
of one column to Scala collection. How can I do this?
val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
.select(col("reporting_table")).except(clientSchemaDF)
Hi Users,
What is equivalent of *df.dropna(axis='columns'**) *of Pandas in the
Spark/Scala?
Thanks
Thanks, you meant in a for loop. could you please put pseudocode in spark
On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote:
> Make every json object a line and then read t as jsonline not as multiline
>
> Am 19.06.2020 um 14:37 schrieb Chetan Khatri >:
>
>
> All
All transactions in JSON, It is not a single array.
On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner
wrote:
> It's an interesting problem. What is the structure of the file? One big
> array? On hash with many key-value pairs?
>
> Stephan
>
> On Thu, Jun 18, 2020 at 6:12 AM Chet
Yes
On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta
wrote:
> Hi,
> So you have a single JSON record in multiple lines?
> And all the 50 GB is in one file?
>
> Regards,
> Gourav
>
> On Thu, 18 Jun 2020, 14:34 Chetan Khatri,
> wrote:
>
>> It is dynamicall
> a role.
>
> Otherwise if it is one large object or array I would not recommend it.
>
> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
> chetan.opensou...@gmail.com>:
> >
> >
> > Hi Spark Users,
> >
> > I have a 50GB of JSON file, I wou
u can use your executors to
> perform the reading instead of the driver.
>
> On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri
> wrote:
>
>> Hi Spark Users,
>>
>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be tak
Hi Spark Users,
I have a 50GB of JSON file, I would like to read and persist at HDFS so it
can be taken into next transformation. I am trying to read as
spark.read.json(path) but this is giving Out of memory error on driver.
Obviously, I can't afford having 50 GB on driver memory. In general,
Hi Spark Users,
How can I provide join ON condition at run time in the form of String to
the code? Can someone please help me.
tor slots. But even then, you're
> welcome to, say, use thread pools to execute even more concurrently as
> most are I/O bound. Your code can do what you want.
>
> On Thu, May 14, 2020 at 6:14 PM Chetan Khatri
> wrote:
> >
> > Thanks, that means number of executor = numb
ber of executor slots. Yes you can only
> simultaneously execute as many tasks as slots regardless of partitions.
>
> On Thu, May 14, 2020, 5:19 PM Chetan Khatri
> wrote:
>
>> Thanks Sean, Jerry.
>>
>> Default Spark DataFrame partitions are 200 right? does it have
>>
you do this within the context of an operation that is
> already parallelized such as a map, the work will be distributed to
> executors and they will do it in parallel. I could be wrong about this as I
> never investigated this specific use case, though.
> >
> > On Thu, May 1
>
> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri
> wrote:
>
>> Hi Spark Users,
>>
>> How can I invoke the Rest API call from Spark Code which is not only
>> running on Spark Driver but distributed / parallel?
>>
>> Spark with Scala is my tech stack.
>>
>> Thanks
>>
>>
>>
>
> --
> http://www.google.com/profiles/grapesmoker
>
Hi Spark Users,
How can I invoke the Rest API call from Spark Code which is not only
running on Spark Driver but distributed / parallel?
Spark with Scala is my tech stack.
Thanks
x.html#xpath>, there
> appears to be no reason to expect that to work.
>
> On Tue, May 12, 2020 at 2:09 PM Chetan Khatri
> wrote:
>
>> Can someone please help.. Thanks in advance.
>>
>> On Mon, May 11, 2020 at 5:29 PM Chetan Khatri <
>> chetan.opensou..
che.org/docs/2.4.5/api/sql/index.html#xpath>, there
> appears to be no reason to expect that to work.
>
> On Tue, May 12, 2020 at 2:09 PM Chetan Khatri
> wrote:
>
>> Can someone please help.. Thanks in advance.
>>
>> On Mon, May 11, 2020 at 5:29 PM Chetan Khatri &l
Can someone please help.. Thanks in advance.
On Mon, May 11, 2020 at 5:29 PM Chetan Khatri
wrote:
> Hi Spark Users,
>
> I want to parse xml coming in the query columns and get the value, I am
> using *xpath_int* which works as per my requirement but When I am
> embedding in the
Hi Spark Users,
I want to parse xml coming in the query columns and get the value, I am
using *xpath_int* which works as per my requirement but When I am embedding
in the Spark SQL query columns it is failing.
select timesheet_profile_id,
*xpath_int(timesheet_profile_code,
;
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical conten
Hi Spark Users,
I've a spark job where I am reading the parquet path, and that parquet path
data is generated by other systems, some of the parquet paths doesn't
contains any data which is possible. is there a any way to read the parquet
if no data found I can create a dummy dataframe and go
> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>
>
>
> Le mar. 28 avr. 2020 à 23:18, Chetan Khatri
> a écrit :
>
>> Hi Spark Users,
>>
>> My spark job gave me an error No Space left on the device
>>
>
> to drop me a mail or raise an issue here:
> https://github.com/qubole/spark-acid/issues
>
> Regards,
> Amogh
>
> On Tue, Mar 10, 2020 at 4:20 AM Chetan Khatri
> wrote:
>
>> Hi Venkata,
>> Thanks for your reply. I am using HDP 2.6 and I don't think above wi
gt; through strings, you should format them as above. If you use Dataset.map
> you can access the timestamps as java.sql.Timestamp objects (but that might
> not be necessary):
>
> import java.sql.Timestamp
> case class Times(value: Timestamp)
> timestampDF.as[Times].map(t => t.va
Hi Spark Users,
My spark job gave me an error No Space left on the device
ng (non
>> ISO8601) strings and showing timestamps.
>>
>> br,
>>
>> Magnus
>>
>> On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I am losi
Hi Spark Users,
I am losing the timezone value from below format, I tried couple of formats
but not able to make it. Can someone throw lights?
scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value")
sampleDF: org.apache.spark.sql.DataFrame = [value: string]
scala>
ot;),
(1, 0, "miki", "NUM IS NOT NULL AND FLAG!=0 AND WORD == 'MIKI'")
).toDF("num", "flag", "word", "expression")
val derivedDF = sampleDF.withColumn("status",
expr(sampleDF.col("expression").as[String].toString())
Hi Spark Users,
I want to evaluate expression from dataframe column values on other columns
in the same dataframe for each row. Please suggest best approach to deal
with this given that not impacting the performance of the job.
Thanks
Sample code:
val sampleDF = Seq(
(8, 1, "bat", "NUM IS
cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
>
> On Thu, Mar 5, 2020, 6:51 AM Chetan Khatri
> wrote:
>
>> Just followup, if anyone has worried on this before
>>
>> On Wed, Mar 4, 2020 at
Just followup, if anyone has worried on this before
On Wed, Mar 4, 2020 at 12:09 PM Chetan Khatri
wrote:
> Hi Spark Users,
> I want to read Hive ACID managed table data (ORC) in Spark. Can someone
> help me here.
> I've tried, https://github.com/qubole/spark-acid but no success.
>
> Thanks
>
Hi Spark Users,
I want to read Hive ACID managed table data (ORC) in Spark. Can someone
help me here.
I've tried, https://github.com/qubole/spark-acid but no success.
Thanks
(ds.columns.map(col) ++ ds.columns.map(column =>
> md5(col(column)).as(s"$column hash")): _*).show(false)
>
> Enrico
>
> Am 02.03.20 um 11:10 schrieb Chetan Khatri:
>
> Thanks Enrico
> I want to compute hash of all the columns value in the row.
&g
st with this Dataset ds:
>
> import org.apache.spark.sql.types._
> val ds = spark.range(10).select($"id".cast(StringType))
>
> Available are md5, sha, sha1, sha2 and hash:
> https://spark.apache.org/docs/2.4.5/api/sql/index.html
>
> Enrico
>
>
> Am 28.02.20 um 13:56
Hi Spark Users,
How can I compute Hash of each row and store in new column at Dataframe,
could someone help me.
Thanks
Hi Spark Users,
Thank you for all support over the mailing list. Contributors - thanks for
your all contributions.
This is my first 5 mins talk with Apache Spark -
https://youtu.be/bBqItpgT8xQ
Thanks.
Thanks, If you can share alternative change in design. I would love to hear
from you.
On Wed, Dec 11, 2019 at 9:34 PM ayan guha wrote:
> No we faced problem with that setup.
>
> On Thu, 12 Dec 2019 at 11:14 am, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>&g
Hi Spark Users,
would that be possible to write to same partition to the parquet file
through concurrent two spark jobs with different spark session.
thanks
Oct 2019 at 11:02 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Could someone please help me.
>>
>> On Thu, Oct 17, 2019 at 7:29 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>
Thanks Jörn
On Sun, Oct 27, 2019 at 8:01 AM Jörn Franke wrote:
> Use yarn queues:
>
>
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>
> Am 27.10.2019 um 06:41 schrieb Chetan Khatri >:
>
>
> Could someone pleas
Could someone please help me to understand better..
On Thu, Oct 17, 2019 at 7:41 PM Chetan Khatri
wrote:
> Hi Users,
>
> I do submit *X* number of jobs with Airflow to Yarn as a part of workflow
> for *Y *customer. I could potentially run workflow for customer *Z *but I
> need to
Could someone please help me.
On Thu, Oct 17, 2019 at 7:29 PM Chetan Khatri
wrote:
> Hi Users,
>
> I am setting spark configuration in below way;
>
> val spark = SparkSession.builder().appName(APP_NAME).getOrCreate()
>
> spark.conf.set("spark.speculation&q
Hi Users,
I do submit *X* number of jobs with Airflow to Yarn as a part of workflow
for *Y *customer. I could potentially run workflow for customer *Z *but I
need to check that how much resources are available over the cluster so
jobs for next customer should start.
Could you please tell what is
Shyam, As mark said - if we boost the parallelism with spark we can reach
to performance of sqoop or better than that.
On Tue, Sep 3, 2019 at 6:35 PM Shyam P wrote:
> J Franke,
> Leave alone sqoop , I am just asking about spark in ETL of Oracle ...?
>
> Thanks,
> Shyam
>
>>
Hi Users,
I am setting spark configuration in below way;
val spark = SparkSession.builder().appName(APP_NAME).getOrCreate()
spark.conf.set("spark.speculation", "false")
spark.conf.set("spark.broadcast.compress", "true")
spark.conf.set("spark.sql.broadcastTimeout", "36000")
e
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 29 Aug 2019 at 21:01, Chetan Khatri
> wrote:
>
>> Hi Users,
ri, 30 Aug 2019 at 06:02, Chetan Khatri
> wrote:
>
>> Sorry,
>> I call sqoop job from above function. Can you help me to resolve this.
>>
>> Thanks
>>
>> On Fri, Aug 30, 2019 at 1:31 AM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
&g
Sorry,
I call sqoop job from above function. Can you help me to resolve this.
Thanks
On Fri, Aug 30, 2019 at 1:31 AM Chetan Khatri
wrote:
> Hi Users,
> I am launching a Sqoop job from Spark job and would like to FAIL Spark job
> if Sqoop job fails.
>
> def executeSqoopOrig
Hi Users,
I am launching a Sqoop job from Spark job and would like to FAIL Spark job
if Sqoop job fails.
def executeSqoopOriginal(serverName: String, schemaName: String,
username: String, password: String,
query: String, splitBy: String, fetchSize: Int,
numMappers: Int,
path in HDFS. I would suggest to write the
> parquet files to a different path, perhaps to a project space or user home,
> rather than at the root directory.
>
> HTH,
> Deng
>
> On Sat, Jun 8, 2019 at 8:00 AM Chetan Khatri
> wrote:
>
>> Hello Dear Spark Users,
>
Also anyone has any idea to resolve this issue -
https://stackoverflow.com/questions/56390492/spark-metadata-0-doesnt-exist-while-compacting-batch-9-structured-streaming-er
On Fri, Jun 7, 2019 at 5:59 PM Chetan Khatri
wrote:
> Hello Dear Spark Users,
>
> I am trying to write data f
Hello Dear Spark Users,
I am trying to write data from Kafka Topic to Parquet HDFS with Structured
Streaming but Getting failures. Please do help.
val spark: SparkSession =
SparkSession.builder().appName("DemoSparkKafka").getOrCreate()
import spark.implicits._
val dataFromTopicDF = spark
quot;
> *Date: *Tuesday, 23 April 2019 at 11:35
> *To: *Chetan Khatri , Jason Nerothin <
> jasonnerot...@gmail.com>
> *Cc: *user
> *Subject: *Re: Update / Delete records in Parquet
>
>
>
> Hi Chetan,
>
>
>
> I also agree that for this usecase parquet would
s like it might be the wrong sink for a high-frequency change
> scenario.
>
> What are you trying to accomplish?
>
> Thanks,
> Jason
>
> On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri
> wrote:
>
>> Hello All,
>>
>> If I am doing incremental load / delta
Hello All,
If I am doing incremental load / delta and would like to update / delete
the records in parquet, I understands that parquet is immutable and can't
be deleted / updated theoretically only append / overwrite can be done. But
I can see utility tools which claims to add value for that.
Hello Spark Users,
Someone has suggested by breaking 5-5 unpredictable transformation blocks
to Future[ONE STRING ARGUMENT] and claim this can tune the performance. I
am wondering this is a use of explicit Future! in Spark?
Sample code is below:
def writeData( tableName: String):
el("OFF")
>
>
> spark.table("").show(100,truncate=false)
>
> But is there any specific reason you want to write it to hdfs? Is this for
> human consumption?
>
> Regards,
> Nuthan
>
> On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri
> wrote:
>
&
Hello Users,
In spark when I have a DataFrame and do .show(100) the output which gets
printed, I wants to save as it is content to txt file in HDFS.
How can I do this?
Thanks
>
>> How much memory do you have per partition?
>>
>> On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri
>> wrote:
>>
>>> I will get the information and will share with you.
>>>
>>> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
>>&g
anything that is faster. When I ran is on my data ~8-9GB
> I think it took less than 5 mins (don't remember exact time)
>
> On Thu, Apr 4, 2019 at 1:09 PM Chetan Khatri
> wrote:
>
>> Thanks for awesome clarification / explanation.
>>
>> I have cases where update_time can be s
eems like it's meant for cases where you
> literally have redundant duplicated data. And not for filtering to get
> first/last etc.
>
>
> On Thu, Apr 4, 2019 at 11:46 AM Chetan Khatri
> wrote:
>
>> Hello Abdeali, Thank you for your response.
>>
>> Can you p
gt; The min() is faster than doing an orderBy() and a row_number().
> And the dropDuplicates at the end ensures records with two values for the
> same 'update_time' don't cause issues.
>
>
> On Thu, Apr 4, 2019 at 10:22 AM Chetan Khatri
> wrote:
>
>> Hello Dear Spark Users,
>>
>
Hello Dear Spark Users,
I am using dropDuplicate on a DataFrame generated from large parquet file
from(HDFS) and doing dropDuplicate based on timestamp based column, every
time I run it drops different - different rows based on same timestamp.
What I tried and worked
val wSpec =
wrote:
> Hi , please tell me why you need to increase the time?
>
>
>
>
>
> At 2019-01-22 18:38:29, "Chetan Khatri"
> wrote:
>
> Hello Spark Users,
>
> Can you please tell me how to increase the time for Spark job to be in
> *Accept* mode in Yarn.
>
> Thank you. Regards,
> Chetan
>
>
>
>
>
Hello Spark Users,
Can you please tell me how to increase the time for Spark job to be in
*Accept* mode in Yarn.
Thank you. Regards,
Chetan
gt; See also https://issues.apache.org/jira/browse/SPARK-10943.
>
> — Soumya
>
>
> On Nov 21, 2018, at 9:29 PM, Chetan Khatri
> wrote:
>
> Hello Spark Users,
>
> I have a Dataframe with some of Null Values, When I am writing to parquet
> it is failing with below error:
>
Hello Spark Users,
I have a Dataframe with some of Null Values, When I am writing to parquet
it is failing with below error:
Caused by: java.lang.RuntimeException: Unsupported data type NullType.
at scala.sys.package$.error(package.scala:27)
at
Hello Spark Users,
I am working with Spark 2.3.0 with HDP Distribution, where my spark job got
completed successfully but final job status is failed with below error:
What is best way to prevent this kind of errors? Thanks
8/11/21 17:38:15 INFO ApplicationMaster: Final app status: SUCCEEDED,
Dear Spark Users,
I came across little weird MSSQL Query to replace with Spark and I am like
no clue how to do it in an efficient way with Scala + SparkSQL. Can someone
please throw light. I can create view of DataFrame and do it as
*spark.sql *(query)
but I would like to do it with Scala + Spark
python.html
>
> We will continue adding more there.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote:
>
>> Hello Jayant,
>>
>> Thank you so much for suggestion.
like Pandas Dataframe for processing and finally write the
> results back.
>
> In the Spark/Scala/Java code, you get an RDD of string, which we convert
> back to a Dataframe.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Thu, Jul 5
Prem sure, Thanks for suggestion.
On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote:
> try .pipe(.py) on RDD
>
> Thanks,
> Prem
>
> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote:
>
>> Can someone please suggest me , thanks
>>
>> On Tue 3 J
Can someone please suggest me , thanks
On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri,
wrote:
> Hello Dear Spark User / Dev,
>
> I would like to pass Python user defined function to Spark Job developed
> using Scala and return value of that function would be returned to DF /
> Datas
Hello Dear Spark User / Dev,
I would like to pass Python user defined function to Spark Job developed
using Scala and return value of that function would be returned to DF /
Dataset API.
Can someone please guide me, which would be best approach to do this.
Python function would be mostly
Anyone can throw light on this. would be helpful.
On Tue, Jun 5, 2018 at 1:41 AM, Chetan Khatri
wrote:
> All,
>
> I would like to Apply Java Transformation UDF on DataFrame created from
> Table, Flat Files and retrun new Data Frame Object. Any suggestions, with
> respect to
All,
I would like to Apply Java Transformation UDF on DataFrame created from
Table, Flat Files and retrun new Data Frame Object. Any suggestions, with
respect to Spark Internals.
Thanks.
ct(): Dataset[T] = dropDuplicates()
>
> …
>
> def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
>
> …
>
> Aggregate(groupCols, aggCols, logicalPlan)
> }
>
>
>
>
>
>
>
>
>
> *发件人**:* Chetan Khatri [mailto:chetan.opensou.
Georg, Sorry for dumb question. Help me to understand - if i do
DF.select(A,B,C,D)*.distinct() *that would be same as above groupBy without
agg in sql right ?
On Wed, May 30, 2018 at 12:17 AM, Chetan Khatri wrote:
> I don't want to get any aggregation, just want to know rather saying
> di
8 AM, Georg Heiler > > wrote:
>>
>>> Why do you group if you do not want to aggregate?
>>> Isn't this the same as select distinct?
>>>
>>> Chetan Khatri schrieb am Di., 29. Mai
>>> 2018 um 20:21 Uhr:
>>>
>>>> All,
>
e same as select distinct?
>
> Chetan Khatri schrieb am Di., 29. Mai 2018
> um 20:21 Uhr:
>
>> All,
>>
>> I have scenario like this in MSSQL Server SQL where i need to do groupBy
>> without Agg function:
>>
>> Pseudocode:
>>
>>
>
All,
I have scenario like this in MSSQL Server SQL where i need to do groupBy
without Agg function:
Pseudocode:
select m.student_id, m.student_name, m.student_std, m.student_group,
m.student_d
ob from student as m inner join general_register g on m.student_id =
g.student_i
d group by
ave had this
>> issue in the past where all spark slaves tend to send lots of data at once
>> to SQL and that slows down the latency of the rest of the system. We
>> overcame this by using sqoop and running it in a controlled environment.
>>
>> On Wed, May 23, 2018 a
Super, just giving high level idea what i want to do. I have one source
schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
Currently there is c# based ETL application which does extract transform
and load as customer specific schema including indexing etc.
Thanks
On Wed,
his https://docs.microsoft.com/en-us/azure/sql-database/sql-
> database-spark-connector
>
>
>
>
>
> *From: *Chetan Khatri <chetan.opensou...@gmail.com>
> *Date: *Wednesday, May 23, 2018 at 7:47 AM
> *To: *user <user@spark.apache.org>
> *Subject: *Bul
All,
I am looking for approach to do bulk read / write with MSSQL Server and
Apache Spark 2.2 , please let me know if any library / driver for the same.
Thank you.
Chetan
All,
I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I am
running same spark job using spark-submit it is getting success with all
transformations are done.
When I am trying to do spark submit using Livy, at that time Spark Job is
getting invoked and getting success but
But you can still use Stanford NLP library and distribute through spark
right !
On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau wrote:
> So it’s certainly doable (it’s not super easy mind you), but until the
> arrow udf release goes out it will be rather slow.
>
> On Sun,
Anybody reply on this ?
On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:
>
> Hello Spark Users,
>
> I am getting below error, when i am trying to write dataset to parquet
> location. I have enough disk space available. Last time i w
Hello Spark Users,
I am getting below error, when i am trying to write dataset to parquet
location. I have enough disk space available. Last time i was facing same
kind of error which were resolved by increasing number of cores at hyper
parameters. Currently result set data size is almost 400Gig
Process data in micro batch
On 18-Oct-2017 10:36 AM, "Chetan Khatri" <chetan.opensou...@gmail.com>
wrote:
> Your hard drive don't have much space
> On 18-Oct-2017 10:35 AM, "Mina Aslani" <aslanim...@gmail.com> wrote:
>
>> Hi,
>>
>&g
Your hard drive don't have much space
On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote:
> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed"
wrote:
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes
What you can do is at hive creates partitioned column for example date and
use Val finalDf = repartition(data frame.col("date-column")) and later say
insert overwrite tablename partition(date-column) select * from tempview
Would work as expected
On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed"
eed to increase it). Honestly most people
> find this number for their job "experimentally" (e.g. they try a few
> different things).
>
> On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri <chetan.opensou...@gmail.com
> > wrote:
>
>> Ryan,
>> Thank y
1 - 100 of 150 matches
Mail list logo