Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid. I have created a section in Apache Spark Community Stack called spark foundation. spark-foundation - Apache Spark Community - Slack I invite you to add your weblink to that section.

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG I have got some articles about this subject that should help. E.g. https://khalidmammadov.github.io/spark/spark_internals_rdd.html Also check other Spark Internals on web. Regards Khalid On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, wrote: > Thank you for your information, >

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
yes history refers to completed jobs. 4040 is the running jobs you should have screen shots for executors and stages as well. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
Thank you for your information, I have tracked the spark history server on port 18080 and the spark UI on port 4040. I see the result of these two tools as similar right? I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in the images does, is it possible?

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040? have a look. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Help needed regarding error with 5 node Spark cluster (shuffle error)- Comcast

2023-01-30 Thread Artemis User
Not sure where you get the property name "spark.memory.offHeap.use". The correct one should be "spark.memory.offHeap.enabled".  See https://spark.apache.org/docs/latest/configuration.html#spark-properties for details. On 1/30/23 10:12 AM, Jain, Sanchi wrote: I am not sure if this is the

Re: Help needed regarding error with 5 node Spark cluster (shuffle error)- Comcast

2023-01-30 Thread Mich Talebzadeh
Hi, Identify the cause of the shuffle. Also how are you using HDFS here? https://community.cloudera.com/t5/Support-Questions/Spark-Metadata-Fetch-Failed-Exception-Missing-an-output/td-p/203771 HTH view my Linkedin profile

Re: Help with Shuffle Read performance

2022-09-30 Thread Igor Calabria
Thanks a lot for the answers foks. It turned out that spark was just IOPs starved. Using better disks solved my issue, so nothing related to kubernetes at all. Have a nice weekend everyone On Fri, Sep 30, 2022 at 4:27 PM Artemis User wrote: > The reduce phase is always more resource-intensive

Re: Help with Shuffle Read performance

2022-09-30 Thread Artemis User
The reduce phase is always more resource-intensive than the map phase.  Couple of suggestions you may want to consider: 1. Setting the number of partitions to 18K may be way too high (the default number is only 200).  You may want to just use the default and the scheduler will

Re: Help with Shuffle Read performance

2022-09-30 Thread Leszek Reimus
Hi Sungwoo, I tend to agree - for a new system, I would probably not go that route, as Spark on Kubernetes is getting there and can do a lot already. Issue I mentioned before can be fixed with proper node fencing - it is a typical stateful set problem Kubernetes has without fencing - node goes

Re: Help with Shuffle Read performance

2022-09-30 Thread Sungwoo Park
Hi Leszek, For running YARN on Kubernetes and then running Spark on YARN, is there a lot of overhead for maintaining YARN on Kubernetes? I thought people usually want to move from YARN to Kubernetes because of the overhead of maintaining Hadoop. Thanks, --- Sungwoo On Fri, Sep 30, 2022 at

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi Leszek, spot on, therefore EMR being created and dynamically scaled up and down and being ephemeral proves that there is actually no advantage of using containers for large jobs. It is utterly pointless and I have attended interviews and workshops where no one has ever been able to prove its

Re: Help with Shuffle Read performance

2022-09-29 Thread Leszek Reimus
Hi Everyone, To add my 2 cents here: Advantage of containers, to me, is that it leaves the host system pristine and clean, allowing standardized devops deployment of hardware for any purpose. Way back before - when using bare metal / ansible, reusing hw always involved full reformat of base

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi, dont containers finally run on systems, and the only advantage of containers is that you can do better utilisation of system resources by micro management of jobs running in it? Some say that containers have their own binaries which isolates environment, but that is a lie, because in a

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
> What's the total number of Partitions that you have ? 18k > What machines are you using ? Are you using an SSD ? Using a family of r5.4xlarges nodes. Yes I'm using five GP3 Disks which gives me about 625 MB/s of sustained throughput (which is what I see when writing the shuffle data). > can

Re: Help with Shuffle Read performance

2022-09-29 Thread Vladimir Prus
Igor, what exact instance types do you use? Unless you use local instance storage and have actually configured your Kubernetes and Spark to use instance storage, your 30x30 exchange can run into EBS IOPS limits. You can investigate that by going to an instance, then to volume, and see monitoring

Re: Help with Shuffle Read performance

2022-09-29 Thread Tufan Rakshit
that's Total Nonsense , EMR is total crap , use kubernetes i will help you . can you please provide whats the size of the shuffle file that is getting generated in each task . What's the total number of Partitions that you have ? What machines are you using ? Are you using an SSD ? Best Tufan

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi, why not use EMR or data proc, kubernetes does not provide any benefit at all for such scale of work. It is a classical case of over engineering and over complication just for the heck of it. Also I think that in case you are in AWS, Redshift Spectrum or Athena for 90% of use cases are way

Re: Help With unstructured text file with spark scala

2022-02-25 Thread Danilo Sousa
Rafael Mendes, Are you from ? Thanks. > On 21 Feb 2022, at 15:33, Danilo Sousa wrote: > > Yes, this a only single file. > > Thanks Rafael Mendes. > >> On 13 Feb 2022, at 07:13, Rafael Mendes > > wrote: >> >> Hi, Danilo. >> Do you have a single large file,

Re: Help With unstructured text file with spark scala

2022-02-21 Thread Danilo Sousa
Yes, this a only single file. Thanks Rafael Mendes. > On 13 Feb 2022, at 07:13, Rafael Mendes wrote: > > Hi, Danilo. > Do you have a single large file, only? > If so, I guess you can use tools like sed/awk to split it into more files > based on layout, so you can read these files into Spark.

Re: Help With unstructured text file with spark scala

2022-02-13 Thread Rafael Mendes
Hi, Danilo. Do you have a single large file, only? If so, I guess you can use tools like sed/awk to split it into more files based on layout, so you can read these files into Spark. Em qua, 9 de fev de 2022 09:30, Bitfox escreveu: > Hi > > I am not sure about the total situation. > But if you

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Bitfox
Hi I am not sure about the total situation. But if you want a scala integration I think it could use regex to match and capture the keywords. Here I wrote one you can modify by your end. import scala.io.Source import scala.collection.mutable.ArrayBuffer val list1 =

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
Hello, how are you? Thanks for your time > Does the data contain records? Yes > Are the records "homogenous" ; ie; do they have the same fields? Yes the data is homogenous but have “two layouts” in the same file. > What is the format of the data? All data is string file .txt > Are records

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
Hello Yes, for this block I can open as csv with # delimiter, but have the block that is no csv format. This is the likely key value. We have two different layouts in the same file. This is the “problem”. Thanks for your time. > Relação de Beneficiários Ativos e Excluídos > Carteira

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
Hello You can treat it as a csf file and load it from spark: >>> df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("sep","#").load(csv_file) >>> df.show() ++---+-+ | Plano|Código

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Lalwani, Jayesh
You will need to provide more info. Does the data contain records? Are the records "homogenous" ; ie; do they have the same fields? What is the format of the data? Are records separated by lines/seperators? Is the data sharded across multiple files? How big is each shard? On 2/8/22, 11:50

Re: help check my simple job

2022-02-06 Thread capitnfrakass
That did resolve my issue. Thanks a lot. frakass n 06/02/2022 17:25, Hannes Bibel wrote: Hi, looks like you're packaging your application for Scala 2.13 (should be specified in your build.sbt) while your Spark installation is built for Scala 2.12. Go to

Re: help check my simple job

2022-02-06 Thread Hannes Bibel
Hi, looks like you're packaging your application for Scala 2.13 (should be specified in your build.sbt) while your Spark installation is built for Scala 2.12. Go to https://spark.apache.org/downloads.html, select under "Choose a package type" the package type that says "Scala 2.13". With that

Re: help on use case - spark parquet processing

2020-08-13 Thread Amit Sharma
Can you keep option field in your case class. Thanks Amit On Thu, Aug 13, 2020 at 12:47 PM manjay kumar wrote: > Hi , > > I have a use case, > > where i need to merge three data set and build one where ever data is > available. > > And my dataset is a complex object. > > Customer > - name -

Re: help understanding physical plan

2019-08-16 Thread Marcelo Valle
Thanks Tianlang. I saw the DAG on YARN, but what really solved my problem is adding intermediate steps and evaluating them eagerly to find out where the bottleneck was. My process now runs in 6 min. :D Thanks for the help. []s On Thu, 15 Aug 2019 at 07:25, Tianlang wrote: > Hi, > > Maybe you

Re: help understanding physical plan

2019-08-15 Thread Tianlang
Hi, Maybe you can look at the spark ui. The physical plan has no time consuming information. 在 2019/8/13 下午10:45, Marcelo Valle 写道: Hi, I have a job running on AWS EMR. It's basically a join between 2 tables (parquet files on s3), one somehow large (around 50 gb) and other small (less

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code. On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > wrote: > > Hi Reynold, > > > I am genuinely curious about queries which are more than 1 MB and am > stunned by tens of MB's. Any samples to share :)  > >

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Gourav Sengupta
Hi Reynold, I am genuinely curious about queries which are more than 1 MB and am stunned by tens of MB's. Any samples to share :) Regards, Gourav On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin wrote: > There is no explicit limit but a JVM string cannot be bigger than 2G. It > will also at some

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will also at some point run out of memory with too big of a query plan tree or become incredibly slow due to query planning complexity. I've seen queries that are tens of MBs in size. On Thu, Jul 11, 2019 at 5:01 AM, 李书明

Re: [HELP WANTED] Apache Zipkin (incubating) needs Spark gurus

2019-03-21 Thread Reynold Xin
Are there specific questions you have? Might be easier to post them here also. On Wed, Mar 20, 2019 at 5:16 PM Andriy Redko wrote: > Hello Dear Spark Community! > > The hyper-popularity of the Apache Spark made it a de-facto choice for many > projects which need some sort of data processing

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
2018/06/21 01:29 Subject: Re: [Help] Codegen Stage grows beyond 64 KB Hi Kazuaki, It would be really difficult to produce a small S-A code to reproduce this problem because, I'm running through a big pipeline of feature engineering where I derive a lot of variables based on the pr

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Aakash Basu
oduce this problem? It would be very helpful that the > community will address this problem. > > Best regards, > Kazuaki Ishizaki > > > > From:vaquar khan > To:Eyal Zituny > Cc:Aakash Basu , user < > user@spark.apache.org> > Date:

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
that the community will address this problem. Best regards, Kazuaki Ishizaki From: vaquar khan To: Eyal Zituny Cc: Aakash Basu , user Date: 2018/06/18 01:57 Subject:Re: [Help] Codegen Stage grows beyond 64 KB Totally agreed with Eyal . The problem is that when Java

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread vaquar khan
Totally agreed with Eyal . The problem is that when Java programs generated using Catalyst from programs using DataFrame and Dataset are compiled into Java bytecode, the size of byte code of one method must not be 64 KB or more, This conflicts with the limitation of the Java class file, which is

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread Eyal Zituny
Hi Akash, such errors might appear in large spark pipelines, the root cause is a 64kb jvm limitation. the reason that your job isn't failing at the end is due to spark fallback - if code gen is failing, spark compiler will try to create the flow without the code gen (less optimized) if you do not

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread Aakash Basu
Hi, I already went through it, that's one use case. I've a complex and very big pipeline of multiple jobs under one spark session. Not getting, on how to solve this, as it is happening over Logistic Regression and Random Forest models, which I'm just using from Spark ML package rather than doing

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread vaquar khan
Hi Akash, Please check stackoverflow. https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe Regards, Vaquar khan On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu wrote: > Hi guys, > > I'm getting an error when I'm feature

Re: Help explaining explain() after DataFrame join reordering

2018-06-05 Thread Matteo Cossu
Hello, as explained here , the join order can be changed by the optimizer. The difference introduced in Spark 2.2 is that the reordering is based on statistics instead of heuristics, that can appear "random"

Re: help needed in perforance improvement of spark structured streaming

2018-05-30 Thread amit kumar singh
hi team any help with this I have a use case where i need to call stored procedure through structured streaming. I am able to send kafka message and call stored procedure , but since foreach sink keeps on executing stored procedure per message i want to combine all the messages in single

Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
Hi Jacek, This is exact what i'm looking for. Thanks!! Also thanks for the link. I just noticed that I can unfold the link of trigger and see the examples in java and scala languages - what a general help for a new comer :-)

Re: help with streaming batch interval question needed

2018-05-25 Thread Jacek Laskowski
Hi Peter, > Basically I need to find a way to set the batch-interval in (b), similar as in (a) below. That's trigger method on DataStreamWriter. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter import

re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
Hi there, from my apache spark streaming website (see links below), - the batch-interval is set when a spark StreamingContext is constructed (see example (a) quoted below) - the StreamingContext is available in older and new Spark version (v1.6, v2.2 to v2.3.0) (see

Re: help in copying data from one azure subscription to another azure subscription

2018-05-23 Thread Pushkar.Gujar
What are you using for storing data in those subscriptions? Datalake or Blobs? There is Azure Data Factory already available that can do copy between these cloud storage without having to go through spark Thank you, *Pushkar Gujar* On Mon, May 21, 2018 at 8:59 AM, amit kumar singh

Re: Help Required - Unable to run spark-submit on YARN client mode

2018-05-08 Thread Deepak Sharma
Can you try increasing the partition for the base RDD/dataframe that you are working on? On Tue, May 8, 2018 at 5:05 PM, Debabrata Ghosh wrote: > Hi Everyone, > I have been trying to run spark-shell in YARN client mode, but am getting > lot of ClosedChannelException

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-22 Thread Naresh Dulam
Hi Sunitha, Make the class which is having the common function your calling as serializable. Thank you, Naresh On Wed, Dec 20, 2017 at 9:58 PM Sunitha Chennareddy < chennareddysuni...@gmail.com> wrote: > Hi, > > Thank You All.. > > Here is my requirement, I have a dataframe which contains

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-20 Thread Sunitha Chennareddy
Hi, Thank You All.. Here is my requirement, I have a dataframe which contains list of rows retrieved from oracle table. I need to iterate dataframe and fetch each record and call a common function by passing few parameters. Issue I am facing is : I am not able to call common function JavaRDD

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Weichen Xu
Hi Sunitha, In the mapper function, you cannot update outer variables such as `personLst.add(person)`, this won't work so that's the reason you got an empty list. You can use `rdd.collect()` to get a local list of `Person` objects first, then you can safely iterate on the local list and do any

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Sunitha Chennareddy
Hi Jorn, In my case I have to call common interface function by passing the values of each rdd. So I have tried iterating , but I was not able to trigger common function from call method as commented in the snippet code in my earlier mail. Request you please share your views. Regards Sunitha

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Jörn Franke
This is correct behavior. If you need to call another method simply append another map, flatmap or whatever you need. Depending on your use case you may use also reduce and reduce by key. However you never (!) should use a global variable as in your snippet. This can to work because you work in

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Sunitha Chennareddy
Hi Deepak, I am able to map row to person class, issue is I want to to call another method. I tried converting to list and its not working with out using collect. Regards Sunitha On Tuesday, December 19, 2017, Deepak Sharma wrote: > I am not sure about java but in scala

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Deepak Sharma
I am not sure about java but in scala it would be something like df.rdd.map{ x => MyClass(x.getString(0),.)} HTH --Deepak On Dec 19, 2017 09:25, "Sunitha Chennareddy" wrote: Hi All, I am new to Spark, I want to convert DataFrame to List with out using

Re: Help taking last value in each group (aggregates)

2017-08-28 Thread Everett Anderson
I'm still unclear on if orderBy/groupBy + aggregates is a viable approach or when one could rely on the last or first aggregate functions, but a working alternative is to use window functions with row_number and a filter kind of like this: import spark.implicits._ val reverseOrdering = Seq("a",

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:

Re: help/suggestions to setup spark cluster

2017-04-27 Thread Cody Koeninger
You can just cap the cores used per job. http://spark.apache.org/docs/latest/spark-standalone.html http://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling On Thu, Apr 27, 2017 at 1:07 AM, vincent gromakowski wrote: > Spark standalone is not

Re: help/suggestions to setup spark cluster

2017-04-27 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe you can try fair scheduling and use one cluster but i doubt it will be prod ready... Le 27 avr. 2017 5:28 AM, "anna stax" a écrit : > Thanks Cody, > > As I already mentioned I am running spark

Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Thanks Cody, As I already mentioned I am running spark streaming on EC2 cluster in standalone mode. Now in addition to streaming, I want to be able to run spark batch job hourly and adhoc queries using Zeppelin. Can you please confirm that a standalone cluster is OK for this. Please provide me

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production. Don't use Yarn or Mesos unless you already have another need for it. On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote: > Hi Sam, > > Thank you for the reply. > > What do you mean by > I doubt people run spark in a.

Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Hi Sam, Thank you for the reply. What do you mean by I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think What is wrong in having a data pipeline on EC2 that reads data from kafka, processes using spark and outputs to cassandra? Please explain. Thanks

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
Hi Anna There are a variety of options for launching spark clusters. I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think I don't have enough information of what you are trying to do but if you are just trying to set things up from scratch then I think

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Hi All, If I modify the code to below The hive UDF is working in spark-sql but it is giving different results..Please let me know difference between these two below codes.. 1) public Object get(Object name) { int pos = getPos((String)name); if(pos<0) return null;

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
How to debug Hive UDfs?! On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in get method and calling that get mthod in > another hive UDF > and running the hive UDF using Hive Context.sql procedure.. > > > switch (f) { > case

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error: java.lang.Double cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable] Getting below error while running hive UDF on spark but the UDF is working perfectly fine in Hive.. public Object get(Object name) {

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Takeshi Yamamuro
Hi, Could you show us the whole code to reproduce that? // maropu On Wed, Jan 25, 2017 at 12:02 AM, Deepak Sharma wrote: > Can you try writing the UDF directly in spark and register it with spark > sql or hive context ? > Or do you want to reuse the existing UDF jar for

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark sql or hive context ? Or do you want to reuse the existing UDF jar for hive in spark ? Thanks Deepak On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in

Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
There is a way, you can use org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows of your dataframe a unique Id On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.a...@gmail.com wrote: Do you have any primary key or unique identifier in your data? Even if multiple

RE: Help needed in parsing JSon with nested structures

2016-10-31 Thread Jan Botorek
Hello, >From my point of view, it would be more efficient and probably i more >"readible" if you just extracted the required data using some json parsing >library (GSON, Jackson), construct some global object (or pre-process data), >and then begin with the Spark operations. Jan From:

Re: Help in generating unique Id in spark row

2016-10-18 Thread ayan guha
Do you have any primary key or unique identifier in your data? Even if multiple columns can make a composite key? In other words, can your data have exactly same 2 rows with different unique ID? Also, do you have to have numeric ID? You may want to pursue hashing algorithm such as sha group to

Re: Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Can any one help me out On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha wrote: > Hi, > > I am in situation where I want to generate unique Id for each row. > > I have use monotonicallyIncreasingId but it is giving increasing values > and start generating from start if it

Re: Help with Jupyter Notebook Settup on CDH using Anaconda

2016-09-03 Thread Marco Mistroni
Hi please paste the exception for Spark vs Jupyter, you might want to sign up for this. It'll give you jupyter and spark...and presumably the spark-csv is already part of it ? https://community.cloud.databricks.com/login.html hth marco On Sat, Sep 3, 2016 at 8:10 PM, Arif,Mubaraka

Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Luciano Resende
Simple, just help us test the available extensions using Spark 2.0.0... preferable in real workloads that you might be using in your day to day usage of Spark. I wrote a quick getting started for using the new MQTT Structured Streaming on my blog, which can serve as an example:

Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Sivakumaran S
Hi, How can I help? regards, Sivakumaran S > On 06-Aug-2016, at 6:18 PM, Luciano Resende wrote: > > Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0. > > https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html >

Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-08 Thread Josh Mahonin
Hi Divya, That's strange. Are you able to post a snippet of your code to look at? And are you sure that you're saving the dataframes as per the docs ( https://phoenix.apache.org/phoenix_spark.html)? Depending on your HDP version, it may or may not actually have phoenix-spark support.

Re: help coercing types

2016-03-19 Thread Jacek Laskowski
Hi, Just a side question: why do you convert DataFrame to RDD? It's like driving backwards (possible but ineffective and dangerous at times) P. S. I'd even go for Dataset. Jacek 18.03.2016 5:20 PM "Bauer, Robert" napisał(a): > I have data that I pull in using a sql

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-03-01 Thread ai he
Hi Divya, I guess the error is thrown from spark-csv. Spark-csv tries to parse string "null" to double. The workaround is to add nullValue option, like .option("nullValue", "null"). But this nullValue feature is not included in current spark-csv 1.3. Just checkout the master of spark-csv and use

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi Jan , Thanks for help. Alas.. you suggestion also didnt work scala> import org.apache.spark.sql.types.{StructType, StructField, > StringType,IntegerType,LongType,DoubleType, FloatType}; > import org.apache.spark.sql.types.{StructType, StructField, StringType, > IntegerType, LongType,

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Jan Štěrba
just use coalesce function df.selectExpr("name", "coalesce(age, 0) as age") -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot wrote: > Hi, > I have dataset which

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Corey Nolet
The whole purpose of Apache mailing lists is that the messages get indexed all over the web so that discussions and questions/solutions can be searched easily by google and other engines. For this reason, and the messages being sent via email as Steve pointed out, it's just not possible to

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Steve Loughran
> On 5 Feb 2016, at 17:35, Marcelo Vanzin wrote: > > You don't... just send a new one. > > On Fri, Feb 5, 2016 at 9:33 AM, swetha kasireddy > wrote: >> Hi, >> >> I want to edit/delete a message posted in Spark User List. How do I do that? >>

Re: Help needed in deleting a message posted in Spark User List

2016-02-05 Thread Marcelo Vanzin
You don't... just send a new one. On Fri, Feb 5, 2016 at 9:33 AM, swetha kasireddy wrote: > Hi, > > I want to edit/delete a message posted in Spark User List. How do I do that? > > Thanks! -- Marcelo

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar
Are you running on YARN or standalone? On Thu, Dec 31, 2015, 3:35 PM LinChen wrote: > *Screenshot1(Normal WebUI)* > > > > *Screenshot2(Corrupted WebUI)* > > > > As screenshot2 shows, the format of my Spark WebUI looks strange and I > cannot click the description of active

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Eugene Morozov
Kendal, have you tried to reduce number of partitions? -- Be well! Jean Morozov On Mon, Dec 28, 2015 at 9:02 AM, kendal wrote: > My driver is running OOM with my 4T data set... I don't collect any data to > driver. All what the program done is map - reduce - saveAsTextFile.

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this? is there any chance that a single key - or set of keys- key has a large number of values relative to the other keys (aka. skew)? if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although I had some issues still with 1.5.1 in a similar

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop 2.4.1.but I also find something strange like this : > http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html > (if i use "textFile",It can't run.) In the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
sorry typo, I meant *without* the addJar On 14 December 2015 at 11:13, Jakob Odersky wrote: > > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop > 2.4.1.but I also find something strange like this : > > > >

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
It looks like you have an issue with your classpath, I think it is because you add a jar containing Spark twice: first, you have a dependency on Spark somewhere in your build tool (this allows you to compile and run your application), second you re-add Spark here >

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
Btw, Spark 1.5 comes with support for hadoop 2.2 by default On 11 December 2015 at 03:08, Bonsen wrote: > Thank you,and I find the problem is my package is test,but I write package > org.apache.spark.examples ,and IDEA had imported the > spark-examples-1.5.2-hadoop2.6.0.jar

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky
Could you provide some more context? What is rawData? On 10 December 2015 at 06:38, Bonsen wrote: > I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" > > and I find: > 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, >

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread Sudhanshu Janghel
Can you please paste the stack trace. Sudhanshu

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread manasdebashiskar
Is that the only kind of error you are getting. Is it possible something else dies that gets buried in other messages. Try repairing HDFS (fsck etc) to find if blocks are intact. Few things to check 1) if you have too many small files. 2) Is your system complaining about too many inode etc.. 3)

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Bonsen
I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" and I find: 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 219.216.65.129): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at

Re: Help with type check

2015-12-01 Thread Eyal Sharon
Great, That works perfect !! Also tnx for the links - very helpful On Tue, Dec 1, 2015 at 12:13 AM, Jakob Odersky wrote: > Hi Eyal, > > what you're seeing is not a Spark issue, it is related to boxed types. > > I assume 'b' in your code is some kind of java buffer, where

Re: Help with type check

2015-11-30 Thread Jakob Odersky
Hi Eyal, what you're seeing is not a Spark issue, it is related to boxed types. I assume 'b' in your code is some kind of java buffer, where b.getDouble() returns an instance of java.lang.Double and not a scala.Double. Hence muCouch is an Array[java.lang.Double], an array containing boxed

Re: Help with Couchbase connector error

2015-11-29 Thread Eyal Sharon
Thanks guys , that was very helpful On Thu, Nov 26, 2015 at 10:29 PM, Shixiong Zhu wrote: > Het Eyal, I just checked the couchbase spark connector jar. The target > version of some of classes are Java 8 (52.0). You can create a ticket in >

Re: Help with Couchbase connector error

2015-11-26 Thread Shixiong Zhu
Het Eyal, I just checked the couchbase spark connector jar. The target version of some of classes are Java 8 (52.0). You can create a ticket in https://issues.couchbase.com/projects/SPARKC Best Regards, Shixiong Zhu 2015-11-26 9:03 GMT-08:00 Ted Yu : > StoreMode is from

Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
This implies version mismatch between the JDK used to build your jar and the one at runtime. When building, target JDK 1.7 There're plenty of posts on the web for dealing with such error. Cheers On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon wrote: > Hi, > > I am trying to

  1   2   >