Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical planning. In more detail, I understand when we want to build up filter operations from data like Parquet (when actually reading and filtering HDFS blocks at first not filtering in memory

Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Hyukjin Kwon
Hi all, While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1. This differs encoding types of the file. Is this intendedly fixed for some reasons? I changed codes and tested to write this as writer version2 and it looks fine. In more

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
change, right? > > It’s not a big change to 2.x API. if you agree, I can do, but I cannot > promise the time within one or two weeks because of my daily job. > > > > > > On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Hi all, > >

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Hyukjin Kwon
5: string, COLUMN6: > string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int, COLUMN11: > int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15: string, > COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19: string, > COLUMN20: string, COLUMN21: string, COLUMN

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Hyukjin Kwon
Hi Divya, Are you using or have you tried Spark CSV datasource https://github.com/databricks/spark-csv ? Thanks! 2015-12-28 18:42 GMT+09:00 Divya Gehlot : > Hi, > I have input data set which is CSV file where I have date columns. > My output will also be CSV file and

Re: what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Hyukjin Kwon
Hi Andy, This link explains the difference well. https://bzhangusc.wordpress.com/2015/08/11/repartition-vs-coalesce/ Simply the difference is whether it "repartitions" partitions or not. Actually coalesce() with suffering performs exactly woth repartition(). On 29 Dec 2015 08:10, "Andy

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
and reverted. I wrote your case in the comments in that JIRA. 2016-06-15 10:26 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>: > Yea, I met this case before. I guess this is related with > https://issues.apache.org/jira/browse/SPARK-15393. > > 2016-06-15 8:46 GMT+09:00 antoniosi <

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Yea, I met this case before. I guess this is related with https://issues.apache.org/jira/browse/SPARK-15393. 2016-06-15 8:46 GMT+09:00 antoniosi : > I tried the following code in both Spark 1.5.1 and Spark 1.6.0: > > import org.apache.spark.sql.types.{ > StructType,

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Hyukjin Kwon
It will 'auto-detect' the compression codec by the file extension and then will decompress and read it correctly. Thanks! 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : > Hi, > > I'm using Spark 1.4.1 (HDP 2.3.2). > As per the spark-csv documentation ( >

Re: NA value handling in sparkR

2016-01-27 Thread Hyukjin Kwon
Hm.. As far as I remember, you can set the value to treat as null with *nullValue* option. Although I am hitting network issues with Github so I can't check this now but please try that option as described in https://github.com/databricks/spark-csv. 2016-01-28 0:55 GMT+09:00 Felix Cheung

Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-31 Thread Hyukjin Kwon
Hm.. As I said here https://github.com/databricks/spark-csv/issues/245#issuecomment-177682354, It sounds reasonable in a way though. For me, this might be to deal with some narrow use-cases. How about using csvRdd(),

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-02-25 Thread Hyukjin Kwon
Hi, it looks you forgot to specify the "rowTag" option, which is "book" for the case of the sample data. Thanks 2016-01-29 8:16 GMT+09:00 Andrés Ivaldi : > Hi, could you get it work, tomorrow I'll be using the xml parser also, On > windows 7, I'll let you know the results.

Documentation for "hidden" RESTful API for submitting jobs (not history server)

2016-03-14 Thread Hyukjin Kwon
Hi all, While googling Spark, I accidentally found a RESTful API existing in Spark for submitting jobs. The link is here, http://arturmkrtchyan.com/apache-spark-hidden-rest-api As Josh said, I can see the history of this RESTful API, https://issues.apache.org/jira/browse/SPARK-5388 and also

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Hyukjin Kwon
Could I ask which version are you using? It looks the cause is the empty line right after header (because that case is not being checked in tests). However, for empty lines before the header or inside date, they are being tested.

Re: is there any way to submit spark application from outside of spark cluster

2016-03-26 Thread Hyukjin Kwon
Hi, For RESTful API for submitting an application, please take a look at this link. http://arturmkrtchyan.com/apache-spark-hidden-rest-api On 26 Mar 2016 12:07 p.m., "vetal king" wrote: > Prateek > > It's possible to submit spark application from outside application. If

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi, I guess this is not a CSV-datasource specific problem. Does loading any file (eg. textFile()) work as well? I think this is related with this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html .

Re: XML Data Source for Spark

2016-04-25 Thread Hyukjin Kwon
Hi Janan, Sorry, I was sleeping. I guess you sent a email to me first and then ask it to mailing list because I am not answering. I just tested this to double-check and could produce the same exception below: java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

Re: removing header from csv file

2016-04-27 Thread Hyukjin Kwon
There are two ways to do so. Firstly, this way will make sure cleanly it skips the header. But of course the use of mapWithIndex decreases performance rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } Secondly, you can do val header = rdd.first() val data =

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Could you maybe share your codes? On 26 Apr 2016 9:51 p.m., "Ramkumar V" wrote: > Hi, > > I had loaded JSON file in parquet format into SparkSQL. I can't able to > read List which is inside JSON. > > Sample JSON > > { > "TOUR" : { > "CITIES" :

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat, https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala, which returns false for isSplittable. In this case, only single mapper appears for the entire

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html If the file is single file, then this would not be distributed. On 26 Apr 2016 11:52 p.m., "Ted Yu" wrote: > Please take a look at: > core/src/main/scala/org/apache/spark/SparkContext.scala > >* Do `val

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
avaRDD ones = jRDD.map(new Function<Row,String>() { public String call(Row row) throws Exception { return row.getString(1); } }); *Thanks*, <https://in.linkedin.com/in/ramkumarcs31> On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote

Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
Hi Sourav, I think it is an issue. XML will assume the element by the rowTag as object. Could you please open an issue in https://github.com/databricks/spark-xml/issues please? Thanks! 2016-05-01 5:08 GMT+09:00 Sourav Mazumder : > Hi, > > Looks like there is a

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know. I think the most clear way is just to run a job on multiple files with the API and check the number of tasks in the job. On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote: wholeTex

Re: Does spark support Apache Arrow

2016-05-19 Thread Hyukjin Kwon
FYI, there is a JIRA for this, https://issues.apache.org/jira/browse/SPARK-13534 I hope this link is helpful. Thanks! 2016-05-20 11:18 GMT+09:00 Sun Rui : > 1. I don’t think so > 2. Arrow is for in-memory columnar execution. While cache is for in-memory > columnar storage

Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon
Hi Arunkumar, I guess your records are self-closing ones. There is an issue open here, https://github.com/databricks/spark-xml/issues/92 This is about XmlInputFormat.scala and it seems a bit tricky to handle the case so I left open until now. Thanks! 2016-05-13 5:03 GMT+09:00 Arunkumar

Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
k_116 I tested this with the codes below val path = "path-to-file" sqlContext.read .format("xml") .option("rowTag", "bkval") .load(path) .show() ​ Thanks! 2016-05-01 15:11 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>: > Hi Sourav,

Re: Parse Json in Spark

2016-05-08 Thread Hyukjin Kwon
I remember this Jira, https://issues.apache.org/jira/browse/SPARK-7366. Parsing multiple lines are not supported in Json fsta source. Instead this can be done by sc.wholeTextFiles(). I found some examples here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

Re: In-Memory Only Spark Shuffle

2016-04-15 Thread Hyukjin Kwon
This reminds me of this Jira, https://issues.apache.org/jira/browse/SPARK-3376 and this PR, https://github.com/apache/spark/pull/5403. AFAIK, it is not and won't be supported. On 2 Apr 2016 4:13 a.m., "slavitch" wrote: > Hello; > > I’m working on spark with very large memory

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
t know how to use it. I’m still learning Scala on my > own. Can you help me to start? > > Thanks, > Ben > > On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > I hope it was not too late :). > > It is possible. > > Please check csvRdd

Re: JSON Usage

2016-04-17 Thread Hyukjin Kwon
Hi! Personally, I don't think it necessarily needs to be DataSet for your goal. Just select your data at "s3" from DataFrame loaded by sqlContext.read.json(). You can try to printSchema() to check the nested schema and then select the data. Also, I guess (from your codes) you are trying to

Re: WELCOME to user@spark.apache.org

2016-04-17 Thread Hyukjin Kwon
Hi Jinan, There are some examples for XML here, https://github.com/databricks/spark-xml/blob/master/src/test/java/com/databricks/spark/xml/JavaXmlSuite.java for test codes. Or, you can see documentation in README.md. https://github.com/databricks/spark-xml#java-api. There are other basic Java

Re: How does .jsonFile() work?

2016-04-19 Thread Hyukjin Kwon
Hi, I hope I understood correctly. This is a simplified procedures. Precondition - JSON file is written line by line. Each is each JSON document. - Root array is supported, eg. [{...}, {...} {...}] Procedures - Schema inference (If user schema is not given) 1.

Re: Spark/Parquet

2016-04-14 Thread Hyukjin Kwon
Currently Spark uses Parquet 1.7.0 (parquet-mr). If you meant writer version2 (parquet-format), you can specify this by manually setting as below: sparkContext.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_2_0.toString) 2016-04-15 2:21

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Hyukjin Kwon
Hi, String comparison itself is pushed down fine but the problem is to deal with Cast. It was pushed down before but is was reverted, ( https://github.com/apache/spark/pull/8049). Several fixes were tried here, https://github.com/apache/spark/pull/11005 and etc. but there were no changes to

Re: java.lang.RuntimeException: Unsupported type: vector

2016-07-25 Thread Hyukjin Kwon
I just wonder how your CSV data structure looks like. If my understanding is correct, is SQL type of the VectorUDT is StructType and CSV data source does not support ArrayType and StructType. Anyhow, it seems CSV does not support UDT for now anyway.

Re: spark java - convert string to date

2016-07-31 Thread Hyukjin Kwon
I haven't used this by myself but I guess those functions should work. unix_timestamp() ​ See https://github.com/apache/spark/blob/480c870644595a71102be6597146d80b1c0816e4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2513-L2530 2016-07-31 22:57 GMT+09:00 Tony Lane

Re: DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Hyukjin Kwon
Do you mind if I ask which format you used to save the data? I guess you used CSV and there is a related PR open here https://github.com/apache/spark/pull/14279#issuecomment-237434591 2016-08-12 6:04 GMT+09:00 Jestin Ma : > When I load in a timestamp column and try

Re: Flattening XML in a DataFrame

2016-08-12 Thread Hyukjin Kwon
Hi Sreekanth, Assuming you are using Spark 1.x, I believe this code below: sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml") .selectExpr("manager.id", "manager.name", "explode(manager.subordinates.clerk) as clerk") .selectExpr("id", "name",

Re: Large files with wholetextfile()

2016-07-12 Thread Hyukjin Kwon
Otherwise, please consider using https://github.com/databricks/spark-xml. Actually, there is a function to find the input file name, which is.. input_file_name function,

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
ns > without issues. I would need to have a look at it, but one large file does > not mean one Executor independent of the underlying format. > > On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > There is a good link for this here, > http://searchdatasc

RE: Processing json document

2016-07-07 Thread Hyukjin Kwon
"lastName":"Smith" }, { "firstName":"Peter", "lastName":"Jones"} ] } On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: The link uses wholeTextFiles() API which treats e

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi, Are you maybe possible to switch it to text datasource with input_file_name function? Thanks. On 1 Feb 2017 3:58 a.m., "Manohar753" wrote: Hi All, myspark job is reading data from a folder having different files with same structured data. the red JavaRdd

Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job recruiting emails and even make some efforts to set a filter for it. I don't know the policy in details but I feel inappropriate to send them where, in my experience, Spark users usually ask some questions and discuss about

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is

Re: filter rows by all columns

2017-01-16 Thread Hyukjin Kwon
Hi Shawn, Could we do this as below? for any of true scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: double] scala> df.filter(_.toSeq.exists(v => v == 1)).show() +---+---+ | a| b| +---+---+ | 1|0.5| | 2|1.0|

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
. It's just a dataset where > every record is a case class with only simple types as fields, strings and > dates. There's no nesting. > > That's what confuses me about how it's interpreting the schema. The schema > seems to be one complex field rather than a bunch of simple fields

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Hi Efe, If my understanding is correct, supporting to write/read complex types is not supported because CSV format can't represent the nested types in its own format. I guess supporting them in writing in external CSV is rather a bug. I think it'd be great if we can write and read back CSV in

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, BTW, there is an issue, SPARK-16216, about printing dates and timestamps here. So please ignore the integer values for dates 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>: > Ah, sorry, I should have read this carefully. Do you mind if I ask your > codes to test?

Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as below: Create two df sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8") val strXmlDf =

Re: [Spark2] Error writing "complex" type to CSV

2016-08-22 Thread Hyukjin Kwon
estion. What has changed that this is no longer > possible? The pull request said that it prints garbage. Was that some > regression in 2.0? The same code prints fine in 1.6.1. The field prints as > an array of the values of its fields. > > On Thu, Aug 18, 2016 at 5:56 PM, Hyukjin Kwon &l

Re: Best way to read XML data from RDD

2016-08-21 Thread Hyukjin Kwon
Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do

Re: Flattening XML in a DataFrame

2016-08-16 Thread Hyukjin Kwon
> Hi Experts, > > > > Please suggest. Thanks in advance. > > > > Thanks, > > Sreekanth > > > > *From:* Sreekanth Jella [mailto:srikanth.je...@gmail.com] > *Sent:* Sunday, August 14, 2016 11:46 AM > *To:* 'Hyukjin Kwon' <gurwls...@gmail.com>

Re: Best way to read XML data from RDD

2016-08-22 Thread Hyukjin Kwon
Sent from Samsung Mobile. > > > Original message > From: Darin McBeath <ddmcbe...@yahoo.com> > Date:21/08/2016 17:44 (GMT+05:30) > To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> > > Cc: Diwakar Dhanuskodi <diwa

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1. Please refer this PR. https://github.com/apache/spark/pull/14339 On 1 Sep 2016 2:48 a.m., "Don Drake" wrote: > I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark > 2.0 and have encountered some

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
Hi Selvam, If your report is commented with any character (e.g. #), you can skip these lines via comment option [1]. If you are using Spark 1.x, then you might be able to do this by manually skipping from the RDD and then making this to DataFrame as below: I haven’t tested this but I think this

Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter. ​ 2016-09-11 0:44 GMT+09:00 Jacek Laskowski : > Hi Muhammad, > > sep or delimiter should both work fine. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
r(new StringReader(txt)); > | reader.readAll().map(data => Row(data(3),data(4),data(7), > data(9),data(14)))} > > The above code throws arrayoutofbounce exception for empty line and report > line. > > > On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gurwls...@gmail.

Re: Spark CSV output

2016-09-10 Thread Hyukjin Kwon
Have you tried the quote related options (e.g. `quote` or `quoteMode` *https://github.com/databricks/spark-csv/blob/master/README.md#features )*? On 11 Sep 2016 12:22 a.m., "ayan guha" wrote: > CSV

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347 +1 for creating a JIRA and PR. If you have any problem with this, I would like to do this quickly. On 5 Oct 2016 9:12 p.m., "Laurent Legrand"

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks!

How many are there PySpark Windows users?

2016-09-18 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? On 20 Sep 2016 2:15 a.m., "Mohamed ismail" wrote: > Hi all > > I am trying to read: > > sc.textFile(DataFile).mapPartitions(lines => { > val parser = new

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? BTW, it seems there is something wrong with your email address. I am sending this again. On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote: > It seem

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for

Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181 2016-09-29 18:58 GMT+09:00 Hitesh Goyal : > Hi team, > > > > I have a json document. I want to put spark SQL to it. > > Can you please

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
2015-01-01 > 2016-03-05 > > next i run this code in spark 2.0.1: > spark.read > .format("csv") > .option("header", true) > .option("inferSchema", true) > .load("test.csv") > .printSchema > > the result is: > root &

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert, I am curious about your case. I guess the purpose of timestampFormat and dateFormat is to infer timestamps/dates when parsing/inferring but not to exclude the type inference/parsing. Actually, it does try to infer/parse in 2.0.0 as well (but it fails) so actually I guess there

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType. Do you mind if I ask to share your codes? On 27 Oct 2016 2:16 a.m., "Koert Kuipers" wrote: > is there a reason a column with dates in format -mm-dd in a csv file > is inferred to be

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi, Have you maybe tried the quote related options specified in the documentation? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony : > Hi, I am trying to process a very

Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577 Maybe using explode() would be helpful. Thanks! 2016-10-19 14:05 GMT+09:00 Divya Gehlot : > http://stackoverflow.com/questions/33864389/how-can-i- >

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within coming few days.. 2016-10-24 21:32 GMT+09:00 Sean Owen : > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json. As far as I know, single line json also conplies the standard. I left a comment with RFC in the PR but please let me know if I am wrong at any point. Thanks! [1]https://github.com/apache/spark/pull/15511 On 19 Oct 2016 7:00 a.m.,

Re: How to read a Multi Line json object via Spark

2016-11-15 Thread Hyukjin Kwon
Hi Sree, There is a blog about that, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ It is pretty old but I am sure that it is helpful. Currently, JSON datasource only supports to rest JSON documents formatted according to http://jsonlines.org/ There is an

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Hyukjin Kwon
Maybe it sounds like you are looking for from_json/to_json functions after en/decoding properly. On 16 Nov 2016 6:45 p.m., "kant kodali" wrote: > > > https://spark.apache.org/docs/2.0.2/sql-programming-guide. > html#json-datasets > > "Spark SQL can automatically infer the

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
ll check with new version and try to use different rowTags and >> increase executor-memory tomorrow. I will open a new issue as well. >> >> >> >> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> Hi Ar

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Hyukjin Kwon
Actually, CSV datasource supports encoding option[1] (although it does not support non-ascii compatible encoding types). [1] https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364 On 17 Nov 2016 10:59

Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh, Could you please open an issue in https://github.com/databricks/spark-xml with some codes so that reviewers can reproduce the issue you met? Thanks! 2016-11-14 0:20 GMT+09:00 rakesh sharma : > Hi > > I'm trying to convert an XML file to data frame using

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun, I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while

Re: pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-12 Thread Hyukjin Kwon
Hi Sam, I think I have some answers for two questions. > Humble request: could we replace the "isinstance(col1, str)" tests with "isinstance(col1, basestring)"? IMHO, yes, I believe this should be basestring. Otherwise, some functions would not accept unicode as arguments for columns in Python

Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla, there is the documentation for setting up IDE - https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup I hope this is helpful. 2016-11-04 9:10 GMT+09:00 shyla deshpande : > Hello Everyone, > > I just installed

Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning leaving data as they are, including prefixes). The problem was, each partition needs to produce each record with knowing the namesapces. It is fine to deal with them if they are within each XML documentation (represented as a

Re: JSON Arrays and Spark

2016-10-12 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu,

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at >

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely. I think you could try something like this maybe. val jsonDF = spark.read .option("columnNameOfCorruptRecord", "xxx") .option("mode","PERMISSIVE") .schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Thanks for this but Isn't this what Michael suggested? > > Thanks, > kant > > On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Hi Kant, >> >> How about doing something like this? >> >> import org.apache.spark.sql.functions._

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant, How about doing something like this? import org.apache.spark.sql.functions._ // val df2 = df.select(df("body").cast(StringType).as("body")) val df2 = Seq("""{"a": 1}""").toDF("body") val schema = spark.read.json(df2.as[String].rdd).schema df2.select(from_json(col("body"),

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Oh, I mean another job would *not* happen if the schema is explicitly given. 2017-01-09 16:37 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>: > Hi Appu, > > > I believe that textFile and filter came from... > > https://github.com/apache/spark/blob/branch-2.1/sql/ > cor

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Hi Appu, I believe that textFile and filter came from... https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61 It needs to read a first line even if using the header is disabled and schema inference

Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I haven't tried this by myself but I remember I saw a JIRA in Parquet -

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
ompt response. I appreciate. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Mar 27, 2017 at 2:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > I ju

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
I just tried to build against the current master to help check - https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4a0ae568613 It seems I can't reproduce this as below: scala> spark.range(1).printSchema root |-- id: long (nullable = false) scala>

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe : - "spark.sql.files.ignoreCorruptFiles" option - DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat (this is available from Spark 2.2.0). For example, val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd",

Re: CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread Hyukjin Kwon
I think this is fixed in https://github.com/apache/spark/pull/15767 This should be fixed in 2.1.0. 2017-03-17 3:28 GMT+09:00 George Obama : > Hello, > > > > I am using spark 2.0.2 to read the CSV file with empty columns and is > hitting the issue: > > scala>val df =

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Hyukjin Kwon
Hi, all the options are documented in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter It seems we don't have both options for writing. If the goal is trimming the whitespaces, I think we could do this within dataframe operations (as we talked in the

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Hyukjin Kwon
Cool! 2017-07-13 9:43 GMT+09:00 Denny Lee : > This is amazingly awesome! :) > > On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com > wrote: > >> That's great! >> >> >> >> On 12 July 2017 at 12:41, Felix Cheung wrote: >>

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those mistakes or would you mind if I ask when someone is assigned? When I started to contribute to Spark few years ago, I was confused by this and I am pretty sure some guys are still confused. I do usually say something like

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
It should be not a big deal anyway. Thanks for the details. 2017-07-25 10:09 GMT+09:00 Marcelo Vanzin <van...@cloudera.com>: > On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > However, I see some JIRAs are assigned to someone time to time. Wer

  1   2   >