Re: Spark Data Frame Writer - Range Partiotioning

2017-07-25 Thread Jain, Nishit
But wouldn’t partitioning column partition the data only in Spark RDD? Would it 
also partition columns at disk when data is written (diving data in folders)?

From: ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>>
Date: Friday, July 21, 2017 at 3:25 PM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark Data Frame Writer - Range Partiotioning

How about creating a partituon column and use it?

On Sat, 22 Jul 2017 at 2:47 am, Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:

Is it possible to have Spark Data Frame Writer write based on RangePartioning?

For Ex -

I have 10 distinct values for column_a, say 1 to 10.

df.write
.partitionBy("column_a")


Above code by default will create 10 folders .. column_a=1,column_a=2 
...column_a=10

I want to see if it is possible to have these partitions based on bucket - 
col_a=1to5, col_a=5-10 .. or something like that? Then also have query engine 
respect it

Thanks,

Nishit

--
Best Regards,
Ayan Guha


Spark Data Frame Writer - Range Partiotioning

2017-07-21 Thread Jain, Nishit
Is it possible to have Spark Data Frame Writer write based on RangePartioning?

For Ex -

I have 10 distinct values for column_a, say 1 to 10.

df.write
.partitionBy("column_a")


Above code by default will create 10 folders .. column_a=1,column_a=2 
...column_a=10

I want to see if it is possible to have these partitions based on bucket - 
col_a=1to5, col_a=5-10 .. or something like that? Then also have query engine 
respect it

Thanks,

Nishit


Re: Spark Streaming Job Stuck

2017-06-06 Thread Jain, Nishit
That helped, thanks TD! :D

From: Tathagata Das 
<tathagata.das1...@gmail.com<mailto:tathagata.das1...@gmail.com>>
Date: Tuesday, June 6, 2017 at 3:26 AM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark Streaming Job Stuck

http://spark.apache.org/docs/latest/streaming-programming-guide.html#points-to-remember-1
Hope this helps.

On Mon, Jun 5, 2017 at 2:51 PM, Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:

I have a very simple spark streaming job running locally in standalone mode. 
There is a customer receiver which reads from database and pass it to the main 
job which prints the total. Not an actual use case but I am playing around to 
learn. Problem is that job gets stuck forever, logic is very simple so I think 
it is neither doing any processing nor memory issue. What is strange is if I 
STOP the job, suddenly in logs I see the output of job execution and other 
backed jobs follow! Can some one help me understand what is going on here?

 val spark = SparkSession
  .builder()
  .master("local[1]")
  .appName("SocketStream")
  .getOrCreate()

val ssc = new StreamingContext(spark.sparkContext,Seconds(5))
val lines = ssc.receiverStream(new HanaCustomReceiver())


lines.foreachRDD{x => println("==" + x.count())}

ssc.start()
ssc.awaitTermination()


[enter image description here]<https://i.stack.imgur.com/y1GGr.png>

After terminating program following logs roll which shows execution of the 
batch -

17/06/05 15:56:16 INFO JobGenerator: Stopping JobGenerator immediately 17/06/05 
15:56:16 INFO RecurringTimer: Stopped timer for JobGenerator after time 
1496696175000 17/06/05 15:56:16 INFO JobGenerator: Stopped JobGenerator 
==100

Thanks!



Spark Streaming Job Stuck

2017-06-05 Thread Jain, Nishit
I have a very simple spark streaming job running locally in standalone mode. 
There is a customer receiver which reads from database and pass it to the main 
job which prints the total. Not an actual use case but I am playing around to 
learn. Problem is that job gets stuck forever, logic is very simple so I think 
it is neither doing any processing nor memory issue. What is strange is if I 
STOP the job, suddenly in logs I see the output of job execution and other 
backed jobs follow! Can some one help me understand what is going on here?

 val spark = SparkSession
  .builder()
  .master("local[1]")
  .appName("SocketStream")
  .getOrCreate()

val ssc = new StreamingContext(spark.sparkContext,Seconds(5))
val lines = ssc.receiverStream(new HanaCustomReceiver())


lines.foreachRDD{x => println("==" + x.count())}

ssc.start()
ssc.awaitTermination()


[enter image description here]

After terminating program following logs roll which shows execution of the 
batch -

17/06/05 15:56:16 INFO JobGenerator: Stopping JobGenerator immediately 17/06/05 
15:56:16 INFO RecurringTimer: Stopped timer for JobGenerator after time 
1496696175000 17/06/05 15:56:16 INFO JobGenerator: Stopped JobGenerator 
==100

Thanks!


conf dir missing

2017-05-19 Thread Jain, Nishit
Any one else facing this issue?


I pulled 0.7.0 release from http://ranger.apache.org/download.html

Built it and found that was not conf folder:

ranger-0.7.0-admin/ews/webapp/WEB-INF/classes -> ls
META-INF db_message_bundle.properties org resourcenamemap.properties
conf.dist log4jdbc.properties ranger-plugins

As a result of that when I run ranger-admin-services.sh start I am getting:
find: `/data/ranger/ews/webapp/WEB-INF/classes/conf/': No such file or directory
Starting Apache Ranger Admin Service
Apache Ranger Admin Service failed to start!




https://issues.apache.org/jira/browse/RANGER-1595


Re: PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Found it. Some how my host mapping was messing it up. Changing it to point to 
localhost worked.:

/etc/host

#127.0.0.1 XX.com
127.0.0.1 localhost

From: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Date: Monday, December 19, 2016 at 2:54 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: PySpark: [Errno 8] nodename nor servname provided, or not known

Hi,

I am using pre-built 'spark-2.0.1-bin-hadoop2.7’ and when I try to start 
pyspark, I get following message.
Any ideas what could be wrong? I tried using python3, setting SPARK_LOCAL_IP to 
127.0.0.1 but same error.


~ -> cd /Applications/spark-2.0.1-bin-hadoop2.7/bin/
/Applications/spark-2.0.1-bin-hadoop2.7/bin -> pyspark
Python 2.7.12 (default, Oct 11 2016, 05:24:00)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/12/19 14:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/19 14:50:47 WARN Utils: Your hostname, XX.com resolves to a loopback 
address: 127.0.0.1; using XX.XX.XX.XXX instead (on interface en0)
16/12/19 14:50:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Traceback (most recent call last):
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 115, in __init__
conf, jsc, profiler_cls)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 174, in _do_init
self._accumulatorServer = accumulators._start_update_server()
  File 
"/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 
259, in _start_update_server
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 417, in __init__
self.server_bind()
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 431, in server_bind
self.socket.bind(self.server_address)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

Thanks,
Nishit


PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Hi,

I am using pre-built 'spark-2.0.1-bin-hadoop2.7’ and when I try to start 
pyspark, I get following message.
Any ideas what could be wrong? I tried using python3, setting SPARK_LOCAL_IP to 
127.0.0.1 but same error.


~ -> cd /Applications/spark-2.0.1-bin-hadoop2.7/bin/
/Applications/spark-2.0.1-bin-hadoop2.7/bin -> pyspark
Python 2.7.12 (default, Oct 11 2016, 05:24:00)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/12/19 14:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/19 14:50:47 WARN Utils: Your hostname, XX.com resolves to a loopback 
address: 127.0.0.1; using XX.XX.XX.XXX instead (on interface en0)
16/12/19 14:50:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Traceback (most recent call last):
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 115, in __init__
conf, jsc, profiler_cls)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 174, in _do_init
self._accumulatorServer = accumulators._start_update_server()
  File 
"/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 
259, in _start_update_server
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 417, in __init__
self.server_bind()
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 431, in server_bind
self.socket.bind(self.server_address)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

Thanks,
Nishit


Spark AVRO S3 read not working for partitioned data

2016-11-17 Thread Jain, Nishit
When I read a specific file it works:

val filePath= "s3n://bucket_name/f1/f2/avro/dt=2016-10-19/hr=19/00"
val df = spark.read.avro(filePath)


But if I point to a folder to read date partitioned data it fails:

val filePath="s3n://bucket_name/f1/f2/avro/dt=2016-10-19/"

I get this error:

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 HEAD request failed for 
'/f1%2Ff2%2Favro%2Fdt%3D2016-10-19' - ResponseCode=403, 
ResponseMessage=Forbidden
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:245)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:374)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
at 
com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at 
com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at BasicS3Avro$.main(BasicS3Avro.scala:55)
at BasicS3Avro.main(BasicS3Avro.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)


Am I missing anything?




Re: How do I convert a data frame to broadcast variable?

2016-11-04 Thread Jain, Nishit
Awesome, thanks Silvio!

From: Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>>
Date: Thursday, November 3, 2016 at 12:26 PM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>, 
Denny Lee <denny.g@gmail.com<mailto:denny.g@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?


Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any 
selection you make on the dataframe will get materialized in the query sent via 
JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc().select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view 
the specific query.


Thanks,

Silvio


From: Jain, Nishit <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <denny.g@gmail.com<mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <denny.g@gmail.com<mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
I’d think quoting is only necessary if you are not escaping delimiters in data. 
But we can only share our opinions. It would be good to see something 
documented.
This may be the cause of the issue?: 
https://issues.apache.org/jira/browse/CSV-135

From: Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>>
Date: Thursday, October 27, 2016 at 12:49 PM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: CSV escaping not working

well my expectation would be that if you have delimiters in your data you need 
to quote your values. if you now have quotes without your data you need to 
escape them.

so escaping is only necessary if quoted.

On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
Do you mind sharing why should escaping not work without quotes?

From: Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>>
Date: Thursday, October 27, 2016 at 12:40 PM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: CSV escaping not working

that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks




Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Do you mind sharing why should escaping not work without quotes?

From: Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>>
Date: Thursday, October 27, 2016 at 12:40 PM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: CSV escaping not working

that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks



Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks


CSV escaping not working

2016-10-27 Thread Jain, Nishit
I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks