Anomaly when dealing with Unix timestamp

2018-06-19 Thread Raymond Xie
Hello, I have a dataframe, apply from_unixtime seems to expose an anomaly: *scala> val bhDF4 = bhDF.withColumn("ts1", $"ts" + 28800).withColumn("ts2", from_unixtime($"ts" + 28800,"MMddhhmmss"))* *bhDF4: org.apache.spark.sql.DataFrame = [user_id: int, item_id: int ... 5 more fields]* *scala>

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
parser is based on univocity and you might use the > "spark.read.csc" syntax instead of using the rdd api; > > From my experience, this will better than any other csv parser > > 2018-06-19 16:43 GMT+02:00 Raymond Xie : > >> Thank you Matteo, Askash and Georg: >> >&

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
> wrote: >> >>> use pandas or dask >>> >>> If you do want to use spark store the dataset as parquet / orc. And then >>> continue to perform analytical queries on that dataset. >>> >>> Raymond Xie schrieb am Di., 19. Juni 2018 um

Best way to process this dataset

2018-06-18 Thread Raymond Xie
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is 20GB ssd harddisk and 2GB RAM. The dataset comes with User ID: 987,994 Item ID: 4,162,024 Category ID: 9,439 Behavior type ('pv', 'buy', 'cart', 'fav') Unix Timestamp: span between November 25 to December 03, 2017 I

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
e > > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single > Ubuntu host with no hadoop installed? if I run my job like below, I will > end up with infinite loop at the end. Thank you very

how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Hello, I am wondering how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed? if I run my job like below, I will end up with infinite loop at the end. Thank you very much. rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf

Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
t; > > Best Regards, > > Vamshi T > > > -- > *From:* Raymond Xie > *Sent:* Sunday, June 17, 2018 6:27 AM > *To:* user; Hui Xie > *Subject:* Error: Could not find or load main class > org.apache.spark.launcher.Main > > Hello, > > I

Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Hello, It would be really appreciated if anyone can help sort it out the following path issue for me? I highly doubt this is related to missing path setting but don't know how can I fix it. rxie@ubuntu:~/Downloads/spark$ echo $PATH

spark-shell doesn't start

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in Ubuntu now, here is the error I am encountering: rxie@ubuntu:~/Downloads/spark/bin$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main What am I missing? Thank you very much. Java is installed.

spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in windows now. I have the jar file generated under: C:\RXIE\Learning\Scala\spark2practice\target\scala-2. 11\spark2practice_2.11-0.1.jar The package name is Retail_db and the object is GetRevenuePerOrder. The spark-submit command is: spark-submit

Re: Not able to sort out environment settings to start spark from windows

2018-06-16 Thread Raymond Xie
ome path . > May be spacial char or space on ur path. > > Regards, > Vaquar khan > > On Sat, Jun 16, 2018, 1:36 PM Raymond Xie wrote: > >> I am trying to run spark-shell in Windows but receive error of: >> >> \Java\jre1.8.0_151\bin\java was unexpected at

Not able to sort out environment settings to start spark from windows

2018-06-16 Thread Raymond Xie
I am trying to run spark-shell in Windows but receive error of: \Java\jre1.8.0_151\bin\java was unexpected at this time. Environment: System variables: SPARK_HOME: c:\spark Path: C:\Program Files (x86)\Common

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
<mmistr...@gmail.com> wrote: > Try to use --packages to include the jars. From error it seems it's > looking for main class in jars but u r running a python script... > > On 25 Feb 2017 10:36 pm, "Raymond Xie" <xie3208...@gmail.com> wrote: > > That'

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
; anahita.t.am...@gmail.com> wrote: >> >>> Hi, >>> >>> I think if you remove --jars, it will work. Like: >>> >>> spark-submit /usr/hdp/2.5.0.0-1245/spark/l >>> ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar >>> &g

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
same problem before and solved it by removing --jars. > > Cheers, > Anahita > > On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com> wrote: > >> I am doing a spark streaming on a hortonworks sandbox and am stuck here >> now, can anyone tell me what's wrong

No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
I am doing a spark streaming on a hortonworks sandbox and am stuck here now, can anyone tell me what's wrong with the following code and the exception it causes and how do I fix it? Thank you very much in advance. spark-submit --jars

Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Raymond Xie
@granturing.com> > *Date: *Monday, January 9, 2017 at 2:59 PM > *To: *Raymond Xie <xie3208...@gmail.com>, user <user@spark.apache.org> > *Subject: *Re: How to connect Tableau to databricks spark? > > > > Hi Raymond, > > > > Are you using a Spark 2.0 or

How to connect Tableau to databricks spark?

2017-01-08 Thread Raymond Xie
I want to do some data analytics work by leveraging Databricks spark platform and connect my Tableau desktop to it for data visualization. Does anyone ever make it? I've trying to follow the instruction below but not successful?

subsription

2017-01-08 Thread Raymond Xie
** *Sincerely yours,* *Raymond*

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
.add("minute", StringType) > val jsonContentWithSchema = sqlContext.jsonRDD(jsonRdd, schema) > > But somehow i seem to remember that there was a way , in Spark 2.0, so > that Spark will infer the schema for you.. > > hth > marco > > > > &g

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
urs,* *Raymond* On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales <therevolti...@gmail.com> wrote: > Looks like it's trying to treat that path as a folder, try omitting > the file name and just use the folder path. > > On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie <xie3208..

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
tting > the file name and just use the folder path. > > On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie <xie3208...@gmail.com> wrote: > > Happy new year!!! > > > > I am trying to load a json file into spark, the json file is attached > here. > > >

From Hive to Spark, what is the default database/table

2016-12-31 Thread Raymond Xie
Hello, It is indicated in https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#dataframes when Running SQL Queries Programmatically you can do: from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table") However, it did not indicate what

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-31 Thread Raymond Xie
xcheun...@hotmail.com> wrote: > Have you tried the spark-csv package? > > https://spark-packages.org/package/databricks/spark-csv > > > ---------- > *From:* Raymond Xie <xie3208...@gmail.com> > *Sent:* Friday, December 30, 2016 6:46:11 PM > *T

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
kage/databricks/spark-csv > > > ---------- > *From:* Raymond Xie <xie3208...@gmail.com> > *Sent:* Friday, December 30, 2016 6:46:11 PM > *To:* user@spark.apache.org > *Subject:* How to load a big csv to dataframe in Spark 1.6 > > Hello, > > I see there is usually thi

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
from my Samsung device Original message ---- From: Raymond Xie <xie3208...@gmail.com> Date: 31/12/2016 10:46 (GMT+08:00) To: user@spark.apache.org Subject: How to load a big csv to dataframe in Spark 1.6 Hello, I see there is usually this way to load a csv to dataframe: sqlC

How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
Hello, I see there is usually this way to load a csv to dataframe: sqlContext = SQLContext(sc) Employee_rdd = sc.textFile("\..\Employee.csv") .map(lambda line: line.split(",")) Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name']) Employee_df.show() However in my

What's the best practice to load data from RDMS to Spark

2016-12-30 Thread Raymond Xie
Hello, I am new to Spark, as a SQL developer, I only took some courses online and spent some time myself, never had a chance working on a real project. I wonder what would be the best practice (tool, procedure...) to load data (csv, excel) into Spark platform? Thank you. *Raymond*