Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith
-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 EKS https://medium.com/@vikas.navlani/running-spark-on-aws-eks-1cd4c31786c Richard On 19/02/2024 13:36, Jagannath Majhi wrote: Dear Spark Community, I hope this email finds you well. I am reaching out

Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Richard Smith
icitly use forward slash if path contains gs: and the job now runs successfully. Richard

Re: Implementing TableProvider in Spark 3.0

2020-07-08 Thread Richard Xin
 Saw Sent from Yahoo Mail for iPhone On Wednesday, July 8, 2020, 9:26 PM, Sricheta Ruj wrote: Hello Spark Team   I am trying to use the DataSourceV2 API from Spark 3.0. I wanted to ask in case of write- how do I get the user specified schema?   This is what I am trying to

apache-spark Structured Stateful Streaming with window / SPARK-21641

2019-10-15 Thread Richard Reitmeyer
What’s the right way use Structured Streaming with both state and windows? Looking at the slides from https://www.slideshare.net/databricks/arbitrary-stateful-aggregations-using-structured-streaming-in-apache-spark slides 26 and 31, it looks like stateful processing events for every device

Spark SaveMode

2019-07-19 Thread Richard
h. Thanks, Richard

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary da

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
l's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 19 Jul 2019 at 23:17, Richard wrote: > >> Thanks for the reply, >> my situation is li

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
at 2:26 PM Mich Talebzadeh wrote: > Hi Richard, > > You can use the following to read JSON data into DF. The example is > reading JSON from Kafka topic > > val sc = spark.sparkContext > import spark.implicits._ > // Use map to create the new RDD

Spark dataset to explode json string

2019-07-19 Thread Richard
let's say I use spark to migrate some data from Cassandra table to Oracle table Cassandra Table: CREATE TABLE SOURCE( id UUID PRIMARY KEY, col1 text, col2 text, jsonCol text ); example jsonCol value: {"foo": "val1", "bar", "val2"} I am trying to extract fields from the json column while importing

Re: tcps oracle connection from spark

2019-06-18 Thread Richard Xin
and btw, same connection string works fine when used in SQL Developer.  On Tuesday, June 18, 2019, 03:49:24 PM PDT, Richard Xin wrote: HI, I need help with tcps oracle connection from spark (version:  spark-2.4.0-bin-hadoop2.7) Properties prop = new Properties();prop.putAll(sparkOracle

tcps oracle connection from spark

2019-06-18 Thread Richard Xin
HI, I need help with tcps oracle connection from spark (version:  spark-2.4.0-bin-hadoop2.7) Properties prop = new Properties();prop.putAll(sparkOracle);  // username/password prop.put("javax.net.ssl.trustStore", "path to root.jks");prop.put("javax.net.ssl.trustStorePassword", "password_here");

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Richard Xin
Sent from Yahoo Mail for iPhone On Monday, May 6, 2019, 18:34, Russell Spitzer wrote: Scala version mismatched Spark is shown at 2.12, the connector only has a 2.11 release  On Mon, May 6, 2019, 7:59 PM Richard Xin wrote: org.apache.spark spark-core_2.12 2.4.0 compile

spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Richard Xin
org.apache.spark spark-core_2.12 2.4.0 compile org.apache.spark spark-sql_2.12 2.4.0 com.datastax.spark spark-cassandra-connector_2.11 2.4.1 I run spark-submit I got following exceptions on Spark 2.4.2, it works fine when running  spark-submit under 

Re: [Spark SQL]: sql.DataFrame.replace to accept regexp

2019-03-01 Thread Richard Garris
/latest.html Richard L. Garris Director of Field Engineering Databricks, Inc. rich...@databricks.com Mobile: 650.200.0840 databricks.com <http://databricks.com/> On Fri, Mar 1, 2019 at 2:21 AM Nuno Silva wrote: > Hi, > > Not sure if I'm delivering my request through the right

Re: Use Spark extension points to implement row-level security

2018-08-18 Thread Richard Siebeling
from the existing documentation. Regards, Richard Op vr 17 aug. 2018 om 15:33 schreef Maximiliano Patricio Méndez < mmen...@despegar.com> > Hi, > > I've added table level security using spark extensions based on the > ongoing work proposed for ranger in RANGER-2128. Following the

Use Spark extension points to implement row-level security

2018-08-17 Thread Richard Siebeling
Hi, I'd like to implement some kind of row-level security and am thinking of adding additional filters to the logical plan possibly using the Spark extensions. Would this be feasible, for example using the injectResolutionRule? thanks in advance, Richard

Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-11 Thread Richard Qiao
Would you mind share your code with us to analyze? > On Feb 10, 2018, at 10:18 AM, amit kumar singh wrote: > > Hi Team, > > We have hive external table which has 50 tb of data partitioned on year > month day > > i want to move last 2 month of data into another table >

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining

Spark Dataframe Writer _temporary directory

2018-01-28 Thread Richard Primera
In a situation where multiple workflows write different partitions of the same table. Example: 10 Different processes are writing parquet or orc files for different partitions of the same table foo, at

Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
> Do you have any opinion for the solution. I really appreciate > > > > Onur EKİNCİ > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 d:+90 212 329 7000 > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İs

Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
Curious you are using"jdbc:sqlserve" to connect oracle, why? Also kindly reminder scrubbing your user id password. Sent from my iPhone > On Jan 16, 2018, at 03:00, Onur EKİNCİ wrote: > > Hi, > > We are trying to get data from an Oracle database into Kinetica database

Partition Dataframe Using UDF On Partition Column

2017-12-27 Thread Richard Primera
Greetings, In version 1.6.0, is it possible to write a partitioned dataframe into parquet format using a UDF function on the partition column? I'm using pyspark. Let's say I have a dataframe with coumn `date`, of type string or int, which contains values such as `20170825`. Is it possible to

Anyone know where to find independent contractors in New York?

2017-12-21 Thread Richard L. Burton III
I'm trying to locate four independent contractors who have experience with Spark. I'm not sure where I can go to find experienced Spark consultants. Please, no recruiters. -- -Richard L. Burton III

Re: flatMap() returning large class

2017-12-18 Thread Richard Garris
-deep-learning/blob/f088de45daec06865ac02a9ec1323eb2c9eebb89/src/main/scala/com/databricks/sparkdl/ImageUtils.scala You can reuse this code potentially. Richard Garris Principal Architect Databricks, Inc 650.200.0840 rlgar...@databricks.com On December 17, 2017 at 3:12:41 PM, Don Drake (dondr

Re: flatMap() returning large class

2017-12-14 Thread Richard Garris
storing it as a vector or Array vs a large Java class object? That might be the more prudent approach. -RG Richard Garris Principal Architect Databricks, Inc 650.200.0840 rlgar...@databricks.com On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com) wrote: This sounds like

Determine Cook's distance / influential data points

2017-12-13 Thread Richard Siebeling
Hi, would it be possible to determine the Cook's distance using Spark? thanks, Richard

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Qiao, Richard
to RUNNING (even if 1 executor was allocated)? “ Best Regards Richard On 12/7/17, 2:40 PM, "bsikander" <behro...@gmail.com> wrote: Marcelo Vanzin wrote > I'm not sure I follow you here. This is something that you are > defining, not Spark. Yes, you are right.

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Qiao, Richard
is submitted to executors”. With this concept, you may define your own status. Best Regards Richard On 12/4/17, 4:06 AM, "bsikander" <behro...@gmail.com> wrote: So, I tried to use SparkAppHandle.Listener with SparkLauncher as you suggested. The behavior of Launcher is not

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread Qiao, Richard
Kant, right, we cannot use Driver’s producer in executor. That’s I mentioned “kafka sink” to solve it. This article should be helpful about it https://allegro.tech/2015/08/spark-kafka-integration.html Best Regards Richard From: kant kodali <kanth...@gmail.com> Date: Thursday, December 7

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread Qiao, Richard
;>("topicA", gson.toJson(map))); // send smaller json in a task } } }); When you do it, make sure kafka producer (seek kafka sink for it) and gson’s environment setup correctly in executors. If after this, there is still OOM, let’s discuss further. Best Regar

Re: unable to connect to connect to cluster 2.2.0

2017-12-06 Thread Qiao, Richard
Are you now building your app using spark 2.2 or 2.1? Best Regards Richard From: Imran Rajjad <raj...@gmail.com> Date: Wednesday, December 6, 2017 at 2:45 AM To: "user @spark" <user@spark.apache.org> Subject: unable to connect to connect to cluster 2.2.0 Hi, Recent

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
In the 2nd case, is there any producer’s error thrown in executor’s log? Best Regards Richard From: kant kodali <kanth...@gmail.com> Date: Tuesday, December 5, 2017 at 4:38 PM To: "Qiao, Richard" <richard.q...@capitalone.com> Cc: "user @spark" <user@spark

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
Where do you check the output result for both case? Sent from my iPhone > On Dec 5, 2017, at 15:36, kant kodali wrote: > > Hi All, > > I have a simple stateless transformation using Dstreams (stuck with the old > API for one of the Application). The pseudo code is rough

Re: Access to Applications metrics

2017-12-04 Thread Qiao, Richard
It works to collect Job level, through Jolokia java agent. Best Regards Richard From: Nick Dimiduk <ndimi...@gmail.com> Date: Monday, December 4, 2017 at 6:53 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Access to Applications metrics Bump. On Wed

Re: Add snappy support for spark in Windows

2017-12-04 Thread Qiao, Richard
Junjeng, it worth a try to start your spark local with hadoop.dll/winutils.exe etc hadoop windows support package in HADOOP_HOME, if you didn’t do that yet. Best Regards Richard From: Junfeng Chen <darou...@gmail.com> Date: Monday, December 4, 2017 at 3:53 AM To: "Qiao, Richard&

Re: Add snappy support for spark in Windows

2017-12-04 Thread Qiao, Richard
It seems a common mistake that the path is not accessible by workers/executors. Best regards Richard Sent from my iPhone On Dec 3, 2017, at 22:32, Junfeng Chen <darou...@gmail.com<mailto:darou...@gmail.com>> wrote: I am working on importing snappy compressed json file in

Re: Dynamic Resource allocation in Spark Streaming

2017-12-03 Thread Qiao, Richard
Sourav: I’m using spark streaming 2.1.0 and can confirm spark.dynamicAllocation.enabled is enough. Best Regards Richard From: Sourav Mazumder <sourav.mazumde...@gmail.com> Date: Sunday, December 3, 2017 at 12:31 PM To: user <user@spark.apache.org> Subject: Dyna

Re: [Spark streaming] No assigned partition error during seek

2017-12-01 Thread Qiao, Richard
) Best Regards Richard From: venkat <meven...@gmail.com> Date: Thursday, November 30, 2017 at 8:16 PM To: Cody Koeninger <c...@koeninger.org> Cc: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: [Spark streaming] No assigned partition error during seek I noti

Spark Streaming Kinesis Missing Records

2017-11-24 Thread Richard Moorhead
logs I see many ProvisionedThroughputExceededException however this should be benign in that the KCL should retry those records. Unfortunately I am not seeing the missing records processed at a later date. Where to look next? . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard M

how sequence of chained jars in spark.(driver/executor).extraClassPath matters

2017-09-13 Thread Richard Xin
so let's say I have chained path in spark.driver.extraClassPath/spark.executor.extraClassPath such as /path1/*:/path2/*, and I have different versions of the same jar under those 2 directories, how spark pick the version of jar to use, from /path1/*? Thanks.

can I do spark-submit --jars [s3://bucket/folder/jar_file]? or --jars

2017-07-28 Thread Richard Xin
Can we add extra library (jars on S3) to spark-submit? if yes, how? such as --jars, extraClassPath, extraLibPathThanks,Richard

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread Richard Xin
I believe you could use JOLT (bazaarvoice/jolt) to flatten it to a json string and then to dataframe or dataset. | | | | | | | | | | | bazaarvoice/jolt jolt - JSON to JSON transformation library written in Java. | | | On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Richard Moorhead
ort() .getOrCreate() . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Moorhead Software Engineer richard.moorh...@c2fo.com<mailto:richard.moorh...@gmail.com> C2FO: The World's Market for Working Capital® [http://c2fo.com/wp-content/uploads/sites/1/2016/03/LinkedIN.png] <https://www.linkedin.com/company/c2f

Re: Spark Streaming Job Stuck

2017-06-06 Thread Richard Moorhead
Set your master to local[10]; you are only allocating one core currently. . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Moorhead Software Engineer richard.moorh...@c2fo.com<mailto:richard.moorh...@gmail.com> C2FO: The World's Market for Working Capital® [http://c2fo.

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Richard Moorhead
active and collaborative documents with SQL ... . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Moorhead Software Engineer richard.moorh...@c2fo.com<mailto:richard.moorh...@gmail.com> C2FO: The World's Market for Working Capital® [http://c2fo.com/wp-content/uploads/sit

Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead
operations? logger.info(s"RDD LENGTH: ${events.count}") //nullpointer exception on call to .map val df = events.map(e => { ... } } } . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Moorhead Software Engineer richard.moorh...@c2fo.com<mailto:richard

Questions related to writing data to S3

2017-04-23 Thread Richard Hanson
I have a streaming job which writes data to S3. I know there are saveAs functions helping write data to S3. But it bundles all elements then writes out to S3. So my first question - Is there any way to let saveAs functions write data in batch or single elements instead of whole bundle?

Re: Handling skewed data

2017-04-19 Thread Richard Siebeling
I'm also interested in this, does anyone this? On 17 April 2017 at 17:17, Vishnu Viswanath wrote: > Hello All, > > Does anyone know if the skew handling code mentioned in this talk > https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark? > > If so can I

Spark-shell's performance

2017-04-17 Thread Richard Hanson
I am playing with some data using (stand alone) spark-shell (Spark version 1.6.0) by executing `spark-shell`. The flow is simple; a bit like cp - basically moving local 100k files (the max size is 190k) to S3. Memory is configured as below export SPARK_DRIVER_MEMORY=8192M export

[Spark Core]: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

2017-04-01 Thread Richard Tsai
specializations for Long keys which happen to perform not very well on some specific distributions. Does anyone have ideas about this? Best wishes, Richard // lines of word IDs val data = (1 to 5000).par.map({ _ => (1 to 1000) map { _ => (-1000 * Math.log(Random.nextDouble)).toInt } }).seq //

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Richard Xin
    JavaRDD jsonRDD =             new JavaSparkContext(sparkSession.sparkContext()).parallelize(results);                 Dataset peopleDF = sparkSession.createDataFrame(jsonRDD, Row.class); Richard Xin On Tuesday, March 28, 2017 7:51 AM, Karin Valisova <ka...@datapine.com> wrote:

Re: Fast write datastore...

2017-03-15 Thread Richard Siebeling
maybe Apache Ignite does fit your requirements On 15 March 2017 at 08:44, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > Hi > If queries are statics and filters are on the same columns, Cassandra is a > good option. > > Le 15 mars 2017 7:04 AM, "muthu" a écrit

Re: Continuous or Categorical

2017-03-01 Thread Richard Siebeling
I think it's difficult to determine with certainty if a variable is continuous or categorical, what to do when the values are numbers like 1, 2, 2, 3, 4, 5. These values can both be continuous as categorical. for exa However you could perform some checks: - are there any decimal values > it will

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Richard Xin
try Row newRow = RowFactory.create(row.getString(0), row.getString(1), row.getMap(2)); On Friday, January 27, 2017 10:52 AM, Ankur Srivastava wrote: + DEV Mailing List On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava wrote:

Re: is it possible to read .mdb file in spark

2017-01-26 Thread Richard Siebeling
Hi, haven't used it, but Jackcess should do the trick > http://jackcess.sourceforge.net/ kind regards, Richard 2017-01-25 11:47 GMT+01:00 Selvam Raman <sel...@gmail.com>: > > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" >

is partitionBy of DataFrameWriter supported in 1.6.x?

2017-01-18 Thread Richard Xin
I found contradictions in document 1.6.0 and 2.1.x in http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriterit says: "This is only applicable for Parquet at the moment." in

Re: Kryo On Spark 1.6.0

2017-01-10 Thread Richard Startin
here - http://spark.apache.org/docs/latest/tuning.html <http://spark.apache.org/docs/latest/tuning.html>Cheers, Richard https://richardstartin.com/ From: Enrico DUrso <enrico.du...@everis.com> Sent: 10 January 2017 11:10 To: user@spark.apache.org Subj

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
changes of behaviour or changes in the build process or something like that, kind regards, Richard On 9 January 2017 at 22:55, Richard Siebeling <rsiebel...@gmail.com> wrote: > Hi, > > I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not > p

Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
e the same configuration but using a Spark 2.0.0 is running fine within Vagrant. Could someone please help? thanks in advance, Richard

Re: ToLocalIterator vs collect

2017-01-05 Thread Richard Startin
Why not do that with spark sql to utilise the executors properly, rather than a sequential filter on the driver. Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k If you were sorting just so you could iterate in order, this might save you a couple of sorts too.

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
thanks, I have seen this, but this doesn't cover my question. What I need is read json and include raw json as part of my dataframe. On Friday, December 30, 2016 10:23 AM, Annabel Melongo <melongo_anna...@yahoo.com.INVALID> wrote: Richard, Below documentation will show you how to

DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
nf = new SparkConf().setMaster("local[2]").setAppName("json_test");         JavaSparkContext ctx = new JavaSparkContext(sparkConf);     HiveContext hc = new HiveContext(ctx.sc());     DataFrame df = hc.read().json("files/json/example2.json"); what I need is a DataFrame with columns id, ln, fn, age as well as raw_json string any advice on the best practice in java?Thanks, Richard

Re: access Broadcast Variables in Spark java

2016-12-20 Thread Richard Xin
try this:JavaRDD mapr = listrdd.map(x -> broadcastVar.value().get(x)); On Wednesday, December 21, 2016 2:25 PM, Sateesh Karuturi wrote: I need to process spark Broadcast variables using Java RDD API. This is my code what i have tried so far:This is only

Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
I think limit repartitions your data into a single partition if called as a non terminal operator. Hence zip works after limit because you only have one partition. In practice, I have found joins to be much more applicable than zip because of the strict limitation of identical partitions.

Re: How to get recent value in spark dataframe

2016-12-18 Thread Richard Xin
I am not sure I understood your logic, but it seems to me that you could take a look of Hive's Lead/Lag functions. On Monday, December 19, 2016 1:41 AM, Milin korath wrote: thanks, I tried with left outer join. My dataset having around 400M records and lot

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
rn row;             } From: Richard Xin <richardxin...@yahoo.com> Sent: Saturday, December 17, 2016 8:53 PM To: Yong Zhang; zjp_j...@163.com; user Subject: Re: Java to show struct field from a Dataframe I tried to transform root  |-- latitude: double (nullable = false)  |-- longitude: double (null

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
;).schema().printTreeString();  // prints schema tree OK as expected transformedDf.show();  // java.lang.ClassCastException: [D cannot be cast to java.lang.Double seems to me that the ReturnType of the UDF2 might be the root cause. but not sure how to correct. Thanks,Richard On

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
1217234614718397 {}#yiv7434848277 body {font-size:10.5pt;color:rgb(0, 0, 0);line-height:1.5;}I think the causation is your invanlid Double data , have u checked your data ? zjp_j...@163.com  From: Richard XinDate: 2016-12-17 23:28To: UserSubject: Java to show struct field from a Dataframelet's say I

Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
let's say I have a DataFrame with schema of followings:root  |-- name: string (nullable = true)  |-- location: struct (nullable = true)  |    |-- longitude: double (nullable = true)  |    |-- latitude: double (nullable = true) df.show(); throws following exception: java.lang.ClassCastException:

Re: need help to have a Java version of this scala script

2016-12-17 Thread Richard Xin
iate functionimport static org.apache.spark.sql.functions.callUDF;import static org.apache.spark.sql.functions.col; udf should be callUDF e.g.ds.withColumn("localMonth", callUDF("toLocalMonth", col("unixTs"), col("tz"))) On 17 December 2016 at 09:54, Richa

need help to have a Java version of this scala script

2016-12-16 Thread Richard Xin
what I am trying to do:I need to add column (could be complicated transformation based on value of a column) to a give dataframe. scala script:val hContext = new HiveContext(sc) import hContext.implicits._ val df = hContext.sql("select x,y,cluster_no from test.dc") val len = udf((str: String) =>

Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Ok it looks like I could reconstruct the logic in the Spark UI from the /jobs resource. Thanks. https://richardstartin.com/ From: map reduced <k3t.gi...@gmail.com> Sent: 07 December 2016 19:49 To: Richard Startin Cc: user@spark.apache.org Subject: Re:

Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Is there any way to get this information as CSV/JSON? https://docs.databricks.com/_images/CompletedBatches.png [https://docs.databricks.com/_images/CompletedBatches.png] https://richardstartin.com/ From: Richard Startin <richardstar...@outlook.com> Se

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Richard Startin
help react quickly to increased/reduced capacity. spark.streaming.backpressure.pid.minRate - the default value is 100 (must be positive), batch size won't go below this. spark.streaming.receiver.maxRate - batch size won't go above this. Cheers, Richard https://richards

Spark streaming completed batches statistics

2016-12-05 Thread Richard Startin
Is there any way to get a more computer friendly version of the completes batches section of the streaming page of the application master? I am very interested in the statistics and am currently screen-scraping... https://richardstartin.com

Re: Livy with Spark

2016-12-05 Thread Richard Startin
There is a great write up on Livy at http://henning.kropponline.de/2016/11/06/ On 5 Dec 2016, at 14:34, Mich Talebzadeh > wrote: Hi, Has there been any experience using Livy with Spark to share multiple Spark contexts? thanks Dr

Re: LDA and Maximum Iterations

2016-10-19 Thread Richard Garris
Hi Frank, Two suggestions 1. I would recommend caching the corpus prior to running LDA 2. If you are using EM I would tweak the sample size using the setMiniBatchFraction parameter to decrease the sample per iteration. -Richard On Tue, Sep 20, 2016 at 10:27 AM, Frank Zhang < datami

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Richard Catlin
it with Memory, SSD, and/or HDDs with the DFS as the persistent store, called under-filesystem. Hope this helps. Richard Catlin > On Sep 19, 2016, at 7:56 AM, aka.fe2s <aka.f...@gmail.com> wrote: > > Hi folks, > > What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't me

Fwd: Missing output partition file in S3

2016-09-19 Thread Richard Catlin
> Begin forwarded message: > > From: "Chen, Kevin" > Subject: Re: Missing output partition file in S3 > Date: September 19, 2016 at 10:54:44 AM PDT > To: Steve Loughran > Cc: "user@spark.apache.org" > > Hi Steve, > >

Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Richard Siebeling
). The analytic functions could help when gathering the statistics over the whole set, kind regards, Richard On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Richard, > > can you use analytics functions for this purpose on DF > > HTH &

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
regards, Richard On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Richard, > > What is the business use case for such statistics? > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * >

Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
, is that possible? We could sacrifice a little bit of performance (but not too much), that's why we prefer one pass... Is this possible in the standard Spark or would this mean modifying the source a little bit and recompiling? Is that feasible / wise to do? thanks in advance, Richard

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-19 Thread Richard M
I was using the 1.1 driver. I upgraded that library to 2.1 and it resolved my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27566.html Sent from the Apache Spark User

Spark 1.6.2 HiveServer2 cannot access temp tables

2016-08-11 Thread Richard M
Im attempting to access a dataframe from jdbc: However this temp table is not accessible from beeline when connected to this instance of HiveServer2. -- View this message in context:

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Richard M
How are you calling registerTempTable from hiveContext? It appears to be a private method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html Sent from the Apache Spark User

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-11 Thread Richard M
I am running HiveServer2 as well and when I connect with beeline I get the following: org.apache.spark.sql.internal.SessionState cannot be cast to org.apache.spark.sql.hive.HiveSessionState Do you know how to resolve this? -- View this message in context:

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
fixed! after adding the option -DskipTests everything build ok. Thanks Sean for your help On Thu, Aug 4, 2016 at 8:18 PM, Richard Siebeling <rsiebel...@gmail.com> wrote: > I don't see any other errors, these are the last lines of the > make-distribution log. > Ab

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
016 at 6:30 PM, Sean Owen <so...@cloudera.com> wrote: > That message is a warning, not error. It is just because you're cross > compiling with Java 8. If something failed it was elsewhere. > > > On Thu, Aug 4, 2016, 07:09 Richard Siebeling <rsiebel...@gmail.com> w

Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
--tgz -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602 It fails with the error "bootstrap class path not set in conjunction with -source 1.7" Could you please help? I do not know what this error means, thanks in advance, Richard

Option Encoder

2016-06-23 Thread Richard Marscher
Is there a proper way to make or get an Encoder for Option in Spark 2.0? There isn't one by default and while ExpressionEncoder from catalyst will work, it is private and unsupported. -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Ou

Re: difference between dataframe and dataframwrite

2016-06-16 Thread Richard Catlin
I believe it depends on your Spark application. To write to Hive, use dataframe.saveAsTable To write to S3, use dataframe.write.parquet(“s3://”) Hope this helps. Richard > On Jun 16, 2016, at 9:54 AM, Natu Lauchande <nlaucha...@gmail.com> wrote: > > Does

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
e - I do not see a >> simple reduceByKey replacement. >> >> Regards, >> >> Bryan Jeffrey >> >> > -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher
wrote: > That kind of stuff is likely fixed in 2.0. If you can get a reproduction > working there it would be very helpful if you could open a JIRA. > > On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com > > wrote: > >> A quick unit test

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Richard Marscher
ault. > > That said, I would like to enable that kind of sugar while still taking > advantage of all the optimizations going on under the covers. Can you get > it to work if you use `as[...]` instead of `map`? > > On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher &l

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for the feedback. I think this will address at least some of the > problems you are describing: https://github.com/apache/spark/pull/13425 > > On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rma

Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
to -1 instead of null. Now it's completely ambiguous what data in the join was actually there versus populated via this atypical semantic. Are there additional options available to work around this issue? I can convert to RDD and back to Dataset but that's less than ideal. Thanks, -- *Richard

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Well the task itself is completed (it indeed gives a result) but the tasks in Mesos says killed and it gives an error as Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Kind regards, Richard Op maandag 16 mei 2016 heeft Jacek Laskowski <

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
B.t.w. this is on a single node cluster Op zondag 15 mei 2016 heeft Richard Siebeling <rsiebel...@gmail.com> het volgende geschreven: > Hi, > > I'm getting the following errors running SparkPi on a clean just compiled > and checked Mesos 0.29.0 installation with Spark 1.6.1 >

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
. Please help, thanks in advance, Richard The complete logs are sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10 16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your

  1   2   3   >