Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: custom rdd - do I need a hadoop input format?

2019-09-17 Thread Arun Mahadevan
You can do it with custom RDD implementation. You will mainly implement "getPartitions" - the logic to split your input into partitions and "compute" to compute and return the values from the executors. On Tue, 17 Sep 2019 at 08:47, Marcelo Valle wrote: > Just to be more clear about my

GC problem doing fuzzy join

2019-06-18 Thread Arun Luthra
operation, and each record in the RDD takes the broadcasted table and FILTERS it. There appears to be large GC happening, so I suspect that huge repeated data deletion of copies of the broadcast table is causing GC. Is there a way to fix this pattern? Thanks, Arun

Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out https://github.com/hortonworks-spark/spark-atlas-connector/ On Wed, 15 May 2019 at 19:44, lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage : >

Re: JvmPauseMonitor

2019-04-15 Thread Arun Mahadevan
Spark TaskMetrics[1] has a "jvmGCTime" metric that captures the amount of time spent in GC. This is also available via the listener I guess. Thanks, Arun [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L89 On Mon, 1

Creating Hive Persistent view using Spark Sql defaults to Sequence File Format

2019-03-19 Thread arun rajesh
Hi All , I am using spark 2.2 in EMR cluster. I have a hive table in ORC format and I need to create a persistent view on top of this hive table. I am using spark sql to create the view. By default spark sql creates the view with LazySerde. How can I change the inputformat to use ORC ? PFA

Re: Structured Streaming & Query Planning

2019-03-18 Thread Arun Mahadevan
. The tiny micro-batch use cases should ideally be solved using continuous mode (once it matures) which would not have this overhead. Thanks, Arun On Mon, 18 Mar 2019 at 00:39, Jungtaek Lim wrote: > Almost everything is coupled with logical plan right now, including > updated range for source

Re: use rocksdb for spark structured streaming (SSS)

2019-03-10 Thread Arun Mahadevan
Read the link carefully, This solution is available (*only*) in Databricks Runtime. You can enable RockDB-based state management by setting the following configuration in the SparkSession before starting the streaming query. spark.conf.set( "spark.sql.streaming.stateStore.providerClass",

Re: Question about RDD pipe

2019-01-17 Thread Arun Mahadevan
Yes, the script should be present on all the executor nodes. You can pass your script via spark-submit (e.g. --files script.sh) and then you should be able to refer that (e.g. "./script.sh") in rdd.pipe. - Arun On Thu, 17 Jan 2019 at 14:18, Mkal wrote: > Hi, im trying to ru

Re: Equivalent of emptyDataFrame in StructuredStreaming

2018-11-17 Thread Arun Manivannan
f I haven't done a good job in explaining it well. Cheers, Arun On Tue, Nov 6, 2018 at 7:34 AM Jungtaek Lim wrote: > Could you explain what you're trying to do? It should have no batch for no > data in stream, so it will end up to no-op even it is possible. > > - Jungtaek Lim (Hea

Equivalent of emptyDataFrame in StructuredStreaming

2018-11-05 Thread Arun Manivannan
ate it but I am converting it to DS immediately. So, I am leaning towards this at the moment. * val emptyErrorStream = (spark:SparkSession) => { implicit val sqlC = spark.sqlContext MemoryStream[DataError].toDS() } Cheers, Arun

Re: Error - Dropping SparkListenerEvent because no remaining room in event queue

2018-10-24 Thread Arun Mahadevan
Maybe you have spark listeners that are not processing the events fast enough? Do you have spark event logging enabled? You might have to profile the built in and your custom listeners to see whats going on. - Arun On Wed, 24 Oct 2018 at 16:08, karan alang wrote: > > Pls note - Spark v

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Arun Mahadevan
Heres a proposal to a add - https://github.com/apache/spark/pull/21819 Its always good to set "maxOffsetsPerTrigger" unless you want spark to process till the end of the stream in each micro batch. Even without "maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch completes.

Re: Question of spark streaming

2018-07-27 Thread Arun Mahadevan
. Thanks, Arun From: utkarsh rathor Date: Friday, July 27, 2018 at 5:15 AM To: "user@spark.apache.org" Subject: Question of spark streaming I am following the book Spark the Definitive Guide The following code is executed locally using spark-shell Procedure: Started the s

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
close(null)" is invoked. You can batch your writes in the process and/or in the close. The guess the writes can still be atomic and decided by if “close” returns successfully or throws an exception. Thanks, Arun From: chandan prakash Date: Thursday, July 12, 2018 at 10:37 AM To: Aru

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
Yes ForeachWriter [1] could be an option If you want to write to different sinks. You can put your custom logic to split the data into different sinks. The drawback here is that you cannot plugin existing sinks like Kafka and you need to write the custom logic yourself and you cannot scale the

Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-07-10 Thread Arun Hive
details o what are you doing  On Wed, May 30, 2018 at 12:58 PM Arun Hive wrote: Hi  While running my spark job component i am getting the following exception. Requesting for your help on this:Spark core version - spark-core_2.10-2.1.1 Spark streaming version -spark-streaming_2.10-2.1.1 Spark hive

Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-05-30 Thread Arun Hive
) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) Regards,Arun On Tuesday, May 29, 2018, 1:22:17 PM PDT, Arun Hive wrote: Hi  While running my spark job component i am getting

Closing IPC connection

2018-05-30 Thread Arun Hive
tion(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 77 more Regards,Arun

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread Arun Mahadevan
I think you need to group by a window (tumbling) and define watermarks (put a very low watermark or even 0) to discard the state. Here the window duration becomes your logical batch. - Arun From: kant kodali <kanth...@gmail.com> Date: Thursday, May 3, 2018 at 1:52 AM To: "user @s

Re: [Structured Streaming] Restarting streaming query on exception/termination

2018-04-24 Thread Arun Mahadevan
: StreamingQueryException => // log it } } Thanks, Arun From: Priyank Shrivastava <priya...@gmail.com> Date: Monday, April 23, 2018 at 11:27 AM To: formice <51296...@qq.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: [Structured Streaming] Restartin

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
I assume its going to compare by the first column and if equal compare the second column and so on. From: kant kodali <kanth...@gmail.com> Date: Wednesday, April 18, 2018 at 6:26 PM To: Jungtaek Lim <kabh...@gmail.com> Cc: Arun Iyer <ar...@apache.org>, Mich

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
The below expr might work: df.groupBy($"id").agg(max(struct($"amount", $"my_timestamp")).as("data")).select($"id", $"data.*") Thanks, Arun From: Jungtaek Lim <kabh...@gmail.com> Date: Wednesday, April 18, 2018 at 4:54 PM To:

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
erations is not there yet. Thanks, Arun From: kant kodali <kanth...@gmail.com> Date: Tuesday, April 17, 2018 at 11:41 AM To: Tathagata Das <tathagata.das1...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: can we use mapGroupsWithState in raw sql?

unsubscribe

2018-02-25 Thread Arun Khetarpal

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Arun Rai
Or you can try mounting that drive to all node. On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: > You should use a distributed filesystem such as HDFS. If you want to use > the local filesystem then you have to copy each file to each node. > > > On 29. Sep 2017, at

Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-18 Thread Arun Khetarpal
Ping. I did some digging around in the code base - I see that this is not present currently. Just looking for an acknowledgement Regards, Arun > On 15-Sep-2017, at 8:43 PM, Arun Khetarpal <arunkhetarpa...@gmail.com> wrote: > > Hi - > > Wanted to understand i

[SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-15 Thread Arun Khetarpal
Hi - Wanted to understand if spark sql has GRANT and REVOKE statements available? Is anyone working on making that available? Regards, Arun - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RowMatrix: tallSkinnyQR

2017-06-09 Thread Arun
hi *def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, Matrix]* *In output of this method Q is distributed matrix* *and R is local Matrix* *Whats the reason R is Local Matrix?* -Arun

Rmse recomender system

2017-05-20 Thread Arun
hi all.. I am new to machine learning. i am working on recomender system. for training dataset RMSE is  0.08  while on test data its is 2.345 whats conclusion and what steps can i take to improve Sent from Samsung tablet

spark ML Recommender program

2017-05-17 Thread Arun
hi I am writing spark ML Movie Recomender program on Intelij on windows10 Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory When I set number of iteration 10 works fine When I set number of Iteration 20 I get StackOverFlow error.. Whats the solution?.. thanks Sent from

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread arun kumar Natva
ot sure why they run >> painfully long in spark 2.0. >> >> I am using spark 1.6 & spark 2.0 on HDP 2.5.3 >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6- >> and-much-slower-in-spark-2-0-tp28390.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Regards, Arun Kumar Natva

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
for this issue? I tried playing with spark.memory.fraction and spark.memory.storageFraction. But, it did not help. Appreciate your help on this!!! On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > Thanks for the quick response. > > Its a single XML file and I

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
with new version and try to use different rowTags and increase executor-memory tomorrow. I will open a new issue as well. On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Hi Arun, > > > I have few questions. > > Dose your XML file have like few huge

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size. I am getting an error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' after reading 7 partitions in local mode. In Yarn mode, it throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3 partitions. Any

Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved. https://github.com/databricks/spark-xml/pull/75 How do we enable this option and ignore namespace prefixes? - Arun

Re: Check if a nested column exists in DataFrame

2016-09-13 Thread Arun Patel
at 5:28 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > I'm trying to analyze XML documents using spark-xml package. Since all > XML columns are optional, some columns may or may not exist. When I > register the Dataframe as a table, how do I check if a nested column is > e

Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package. Since all XML columns are optional, some columns may or may not exist. When I register the Dataframe as a table, how do I check if a nested column is existing or not? My column name is "emp" which is already exploded and I am trying to

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-09 Thread Arun Patel
; > github.com > sixers changed the title from Save DF with nested records with the same > name to spark-avro fails to save DF with nested records having the same > name Jun 23, 2015 > > > > -- > *From:* Arun Patel <arunp.bigd...@gmail.com> &g

spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Arun Patel
I'm trying to convert XML to AVRO. But, I am getting SchemaParser exception for 'Rules' which is existing in two separate containers. Any thoughts? XML is attached. df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml')

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote: > This might be caused by a few large Map objects that Spark is tr

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
me? What if I manually split them up into numerous Map variables? On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > I got this OOM error in Spark local mode. The error seems to have been at > the start of a stage (all of the stages on the UI showed

Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
gt; RDD but now returns another dataset or an unexpected implicit conversion. > Just add rdd() before the groupByKey call to push it into an RDD. That > being said - groupByKey generally is an anti-pattern so please be careful > with it. > > On Wed, Aug 10, 2016 at 8:07 PM, Arun

groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
s API change... what is the problem? Thanks, Arun

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
:59 PM, Tathagata Das <t...@databricks.com> wrote: > Correction, the two options are. > > - writeStream.format("parquet").option("path", "...").start() > - writestream.parquet("...").start() > > There no start with param. > > On Jul 30, 201

Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
th or parquet in DataStreamWriter. scala> val query = streamingCountsDF.writeStream. foreach format option options outputMode partitionBy queryName start trigger Any idea how to write this to parquet file? - Arun

Re: Graphframe Error

2016-07-07 Thread Arun Patel
, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" < > arunp.bigd...@gmail.com> wrote: > > Thanks Yanbo and Felix. > > I tried these commands on CDH Quickstart VM and also on "Spark 1.6 > pre-built for Hadoop" version. I am still not able to get it working.

Re: Graphframe Error

2016-07-05 Thread Arun Patel
-0700, "Yanbo Liang" <yblia...@gmail.com> > wrote: > > Hi Arun, > > The command > > bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 > > will automatically load the required graphframes jar file from maven > repository, it was not affected by the

Graphframe Error

2016-07-03 Thread Arun Patel
I am getting below error. >>> from graphframes.examples import Graphs Traceback (most recent call last): File "", line 1, in ImportError: Bad magic number in graphframes/examples.pyc Any help will be highly appreciated. - Arun

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please. On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > Thanks Michael. > > I went thru these slides already and could not find answers for these > specific questions. > > I created a Dataset and converte

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
mit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust > > On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <arunp.bigd...@gmail.com> > wrote: > >> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an >> alias for a Dataset of t

Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
? 4) Compile time safety will be there for DataFrames too? 5) Python API is supported for Datasets in 2.0? Thanks Arun

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Thanks Sean and Jacek. Do we have any updated documentation for 2.0 somewhere? On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski wrote: > On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > > That's not any kind of authoritative statement, just my opinion and

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Do we have any further updates on release date? Also, Is there a updated documentation for 2.0 somewhere? Thanks Arun On Thu, Apr 28, 2016 at 4:50 PM, Jacek Laskowski <ja...@japila.pl> wrote: > Hi Arun, > > My bet is...https://spark-summit.org/2016 :) > > Pozdrawi

Re: Hive_context

2016-05-23 Thread Arun Natva
Can you try a hive JDBC java client from eclipse and query a hive table successfully ? This way we can narrow down where the issue is ? Sent from my iPhone > On May 23, 2016, at 5:26 PM, Ajay Chander wrote: > > I downloaded the spark 1.5 untilities and exported

Re: HBase / Spark Kerberos problem

2016-05-19 Thread Arun Natva
Some of the Hadoop services cannot make use of the ticket obtained by oginUserFromKeytab. I was able to get past it using gss Jaas configuration where you can pass either Keytab file or ticketCache to spark executors that access HBase. Sent from my iPhone > On May 19, 2016, at 4:51 AM, Ellis,

Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request. Would you mind providing an approximate date of Spark 2.0 release? Is it early May or Mid May or End of May? Thanks, Arun

Re: transformation - spark vs cassandra

2016-03-31 Thread Arun Sethia
; using Cassandra (where cdate is part of primary key and country as cluster >> key). >> >> SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA' >> >> I would like to know when should we use Cassandra simple query vs >> dataframe >> in

Re: DataFrame vs RDD

2016-03-22 Thread Arun Sethia
Thanks Vinay. Is it fair to say creating RDD and Creating DataFrame from Cassandra uses SparkSQL, with help of Spark-Cassandra Connector API? On Tue, Mar 22, 2016 at 9:32 PM, Vinay Kashyap wrote: > DataFrame is when there is a schema associated with your RDD.. > For any of

Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn client mode. I set it to 13% of the executor memory. Also quite helpful was increasing the total overall executor memory. It will be great when tungsten enhancements make there way into RDDs. Thanks! Arun On Thu, Jan

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra
rnal label. Then it would work the same as the sc.accumulator() "name" argument. It would enable more useful warn/error messages. Arun

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
e partitions? What is the > action you are performing? > > On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> Example warning: >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, XXX): TaskCom

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
won't have to increase it. The RDD being processed has 2262 partitions. Arun

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
count > will mask this exception because the coordination does not get triggered in > non save/write operations. > > On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> Before we dig too far into this, the thing which most quickly jumps out >> t

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
n Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> Usually the pipeline works, it just failed on this particular input data. >> The other data it has run on is of similar size. >> >> Speculation is enabled. >> >> I'm usin

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
mind. On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Looking into the yarn logs for a similar job where an executor was > associated with the same error, I find: > > ... > 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive &g

groupByKey does not work?

2016-01-04 Thread Arun Luthra
2 times. Is this the expected behavior? I need to be able to get ALL values associated with each key grouped into a SINGLE record. Is it possible? Arun p.s. reducebykey will not be sufficient for me

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
see that each key is repeated 2 times but each key should only appear once. Arun On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you give a bit more information ? > > Release of Spark you're using > Minimal dataset that shows the problem >

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
nt so we can count out any issues in object > equality. > > On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <arun.lut...@gmail.com> wrote: > >> Spark 1.5.0 >> >> data: >> >> p1,lo1,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo2,8,0,4,0,5,20150901|

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
Dec 21, 2015 at 11:04 AM, Arun Patel <arunp.bigd...@gmail.com> > wrote: > >> It may be simple question...But, I am struggling to understand this >> >> DStream is a sequence of RDDs created in a batch window. So, how do I >> know how many RDDs are created

Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
/ spark.streaming.blockInterval) * number of receivers Is it like one RDD per receiver? or Multiple RDDs per receiver? What is the easiest way to find it? Arun

Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
ject();stepResults.put("x", Long.parseLong(row.get(0).toString()));stepResults.put("y", row.get(1));appendResults.add(stepResults);}start = nextStart;nextStart = start + bucketLengthSec;}* -- Thanks and Regards, Arun Verma

Re: Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
Thank you for your reply. It is a Scala and Python library. Is similar library exists for Java? On Wed, Dec 9, 2015 at 10:26 PM, Sean Owen <so...@cloudera.com> wrote: > CC Sandy as his https://github.com/cloudera/spark-timeseries might be > of use here. > > On Wed, Dec 9, 201

types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
[Ljava.lang.String;@13144c [Ljava.lang.String;@75146d [Ljava.lang.String;@79118f Arun

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
Ah, yes, that did the trick. So more generally, can this handle any serializable object? On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney jcove...@gmail.com wrote: array[String] doesn't pretty print by default. Use .mkString(,) for example El jueves, 27 de agosto de 2015, Arun Luthra

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-24 Thread Arun Ahuja
for all the help everyone! But not sure worth still pursuing, not sure what else to try. Thanks, Arun On Tue, Jul 21, 2015 at 11:16 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I've run into similar BLAS related problems before and wrote up a document on how to do

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Arun Ahuja
, would you know? Thanks, Arun On Tue, Jul 21, 2015 at 7:52 AM, Sean Owen so...@cloudera.com wrote: Great, and that file exists on HDFS and is world readable? just double-checking. What classpath is this -- your driver or executor? this is the driver, no? I assume so just because it looks like

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-20 Thread Arun Ahuja
Cool, I tried that as well, and doesn't seem different: spark.yarn.jar seems set [image: Inline image 1] This actually doesn't change the classpath, not sure if it should: [image: Inline image 3] But same netlib warning. Thanks for the help! - Arun On Fri, Jul 17, 2015 at 3:18 PM, Sandy

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
need to be adjusted in my application POM? Thanks, Arun On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote: Yes, that's most of the work, just getting the native libs into the assembly. netlib can find them from there even if you don't have BLAS libs on your OS, since

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep jniloader META-INF/maven/com.github.fommil/jniloader/ META-INF/maven/com.github.fommil/jniloader/pom.xml META-INF/maven/com.github.fommil/jniloader/pom.properties ​ Thanks, Arun On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen so...@cloudera.com wrote: Make

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
PFA sample file On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma arun.verma...@gmail.com wrote: Hi, Yes it is. To do it follow these steps; 1. cd spark/intallation/path/.../conf 2. cp spark-env.sh.template spark-env.sh 3. vi spark-env.sh 4. SPARK_MASTER_PORT=9000(or any other available port

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Thanks and Regards, Arun Verma spark-env.sh Description: Bourne shell script

How to ignore features in mllib

2015-07-09 Thread Arun Luthra
-training-a-classifier Arun

How to change hive database?

2015-07-07 Thread Arun Luthra
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException from: val dataframe = hiveContext.table(other_db.mytable) Do I have to change current database to access it? Is it possible to

What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-07 Thread Arun Ahuja
implementation from: com.github.fommil.netlib.NativeRefLAPACK ​ Anything in this process I missed? Thanks, Arun

Re: unable to bring up cluster with ec2 script

2015-07-07 Thread Arun Ahuja
/ - Arun On Tue, Jul 7, 2015 at 4:34 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I'm following the tutorial about Apache Spark on EC2. The output is the following: $ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training Setting up security groups

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
Thanks Sandy et al, I will try that. I like that I can choose the minRegisteredResourcesRatio. On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
of resources that I request? Thanks, Arun

Missing values support in Mllib yet?

2015-06-19 Thread Arun Luthra
Hi, Is there any support for handling missing values in mllib yet, especially for decision trees where this is a natural feature? Arun

Re: Problem getting program to run on 15TB input

2015-06-09 Thread Arun Luthra
level usage of spark. @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase? Thanks, Kapil Malik | kma...@adobe.com | 33430 / 8800836581 *From:* Daniel Mahler [mailto:dmah...@gmail.com] *Sent:* 13 April 2015 15:42 *To:* Arun Luthra *Cc:* Aaron Davidson; Paweł Szulc

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra arun.lut...@gmail.com wrote: Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table

Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
applications. Is this correct? Regards, Arun

Efficient saveAsTextFile by key, directory for each key?

2015-04-21 Thread Arun Luthra
efficient solution exists... Thanks, Arun

mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com wrote: The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr

Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun

Re: Dataframes Question

2015-04-19 Thread Arun Patel
for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote: Experts, I have few basic

Dataframes Question

2015-04-18 Thread Arun Patel
documentation, it looks like creating dataframe is no different than SchemaRDD - df = sqlContext.jsonFile(examples/src/main/resources/people.json). So, my question is what is the difference? Thanks for your help. Arun

  1   2   >