from:"R \(JIRA\)"

[jira] [Updated] (SPARK-32969) Spark Submit process not exiting after session.stop()

2020-09-22 Thread El R (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

El R updated SPARK-32969:
-
Affects Version/s: (was: 3.0.1)

> Spark Submit process not exiting after session.stop()
> -
>
> Key: SPARK-32969
> URL: https://issues.apache.org/jira/browse/SPARK-32969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.4.7
>Reporter: El R
>Priority: Critical
>
> Exactly 3 spark submit processes are hanging from the first 3 jobs that were 
> submitted to the standalone cluster using client mode. Example from the 
> client:
> {code:java}
> root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
> spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
> pyspark-shell 
> root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
> spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
> pyspark-shell 
> root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True

[jira] [Created] (SPARK-32969) Spark Submit process not exiting after session.stop()

2020-09-22 Thread El R (Jira)

El R created SPARK-32969:


 Summary: Spark Submit process not exiting after session.stop()
 Key: SPARK-32969
 URL: https://issues.apache.org/jira/browse/SPARK-32969
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 3.0.1, 2.4.7
Reporter: El R


Exactly 3 spark submit processes are hanging from the first 3 jobs that were 
submitted to the standalone cluster using client mode. Example from the client:
{code:java}
root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell
 
{code}
The corresponding jobs are showing as 'completed' in spark UI and have closed 
their sessions & exited according to their logs. No worker resources are being 
consumed by these jobs anymore & subsequent jobs are able to receive

[jira] [Updated] (SPARK-27295) Provision to provide initial values for each source node in personalised page rank - Graphx

2019-03-27 Thread Eshwar S R (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshwar S R updated SPARK-27295:
---
Priority: Major  (was: Minor)

> Provision to provide initial values for each source node in personalised page 
> rank - Graphx
> ---
>
> Key: SPARK-27295
> URL: https://issues.apache.org/jira/browse/SPARK-27295
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Eshwar S R
>Priority: Major
>
> The present implementation of parallel personalized page rank algorithm takes 
> only node ids as the starting nodes for algorithm. And then it assigns 
> initial value of 1.0 to all those source nodes.
> But the user might also be interested in specifying the initial values for 
> each node. 
> I have done the required very small modification to the existing code to 
> achieve this. I thought it might help lot more people if I share it here, 
> hence raising a PR for the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27295) Provision to provide initial values for each source node in personalised page rank - Graphx

2019-03-27 Thread Eshwar S R (JIRA)

Eshwar S R created SPARK-27295:
--

 Summary: Provision to provide initial values for each source node 
in personalised page rank - Graphx
 Key: SPARK-27295
 URL: https://issues.apache.org/jira/browse/SPARK-27295
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 2.4.0
Reporter: Eshwar S R


The present implementation of parallel personalized page rank algorithm takes 
only node ids as the starting nodes for algorithm. And then it assigns initial 
value of 1.0 to all those source nodes.

But the user might also be interested in specifying the initial values for each 
node. 

I have done the required very small modification to the existing code to 
achieve this. I thought it might help lot more people if I share it here, hence 
raising a PR for the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-12-31 Thread Samik R (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731349#comment-16731349
 ] 

Samik R commented on SPARK-19217:
-

Any update on this? Still seems useful: I am trying to get couple of values 
from a VectorUDT type.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-12-03 Thread indraneel r (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708245#comment-16708245
 ] 

indraneel r commented on SPARK-26206:
-

[~kabhwan] 

Here are some of my observations:

 - The error only ones when you get the data for the second batch. The first 
batch goes through fine. You may not see the output sometimes, not sure why 
though, but you can see the output once you start pumping new data into the 
kafka topic. And this is when it throws the error.

 - This same query works well with spark 2.3.0.

 

Heres how some sample data looks like:



 
{code:java}
[
 {
 "timestamp": 1541043341540,
 "cid": "333-333-333",
 "uid": "11-111-111",
 "sessionId": "11-111-111",
 "merchantId": "",
 "event": "-222-222",
 "ip": "1.1.1.1",
 "refUrl": "",
 "referrer": "",
 "section": "lorem",
 "tag": "lorem,ipsum",
 "eventType": "Random_event_1",
 "sid": "qwwewew"
 },
 {
 "timestamp": 1541043341540,
 "cid": "333-444-444",
 "uid": "11-555-111",
 "sessionId": "11-111-111",
 "merchantId": "3331",
 "event": "-222-333",
 "ip": "1.1.2.1",
 "refUrl": "",
 "referrer": "",
 "section": "ipsum",
 "tag": "lorem,ipsum2",
 "eventType": "Random_event_2",
 "sid": "xxxdfffwewe"
 }]
{code}
 

 

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Major
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-29 Thread indraneel r (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704277#comment-16704277
 ] 

indraneel r commented on SPARK-26206:
-

Will check on spark-shell but not sure if it will make any difference. Tried 
the code on scala 2.11 as well. Its the same issue

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
>

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702757#comment-16702757
 ] 

indraneel r commented on SPARK-26206:
-

[~kabhwan] 
Have added the details in the description.

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Environment: 
Operating system : MacOS Mojave
 spark version : 2.4.0
spark-sql-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7

  was:
Operating system : MacOS Mojave
 spark version : 2.4.0
 spark-streaming-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7


> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
> Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Environment: 
Operating system : MacOS Mojave
 spark version : 2.4.0
 spark-streaming-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7

  was:
Operating system : MacOS Mojave
spark version : 2.4.0
spark-streaming-kafka-0-10 : 2.4.0
kafka version 1.1.1


> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
>  spark-streaming-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
> Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at

[jira] [Created] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)

indraneel r created SPARK-26206:
---

 Summary: Spark structured streaming with kafka integration fails 
in update mode 
 Key: SPARK-26206
 URL: https://issues.apache.org/jira/browse/SPARK-26206
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
 Environment: Operating system : MacOS Mojave
spark version : 2.4.0
spark-streaming-kafka-0-10 : 2.4.0
kafka version 1.1.1
Reporter: indraneel r


Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2018-08-18 Thread Chakradhar N R (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584732#comment-16584732
 ] 

Chakradhar N R commented on SPARK-24432:


What are the changes related to shuffle service which is happening? I tried 
searching the spark developers list but could not find any references. Can you 
reference a few JIRA's as it will be helpful. In the K8S-SIG-BIGDATA, there is 
a design doc for spark shuffle 2.0 but is not yet made public. Can that also be 
referenced here.

Thanks

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2018-08-02 Thread Samik R (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566708#comment-16566708
 ] 

Samik R commented on SPARK-22600:
-

Interestingly, the exception doesn't get caught in a try-catch block. We wanted 
to suppress the exception messages for now, and surrounded the specific calls 
in a try catch block - but it doesn't seem to do anything, and the messages 
still show up.
{code:java}
try
{
    predictionArray(x).persist(StorageLevel.MEMORY_ONLY_SER)
}
catch
{
    case ex: org.codehaus.janino.InternalCompilerException => println("### 
Caught InternalCompilerException")
}
{code}
Does anyone know why?

> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24608) report number of iteration/progress for ML training

2018-06-20 Thread R (JIRA)

R created SPARK-24608:
-

 Summary: report number of iteration/progress for ML training
 Key: SPARK-24608
 URL: https://issues.apache.org/jira/browse/SPARK-24608
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.1
Reporter: R


Debugging big ML models requires careful control of resources (memory, storage, 
CPU, progress, etc). Current ML training reports no progress. It would be ideal 
to be more verbose during training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore

2018-03-10 Thread R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394117#comment-16394117
 ] 

R commented on SPARK-23510:
---

[~q79969786] - can you add fix version of 2.3.1 to this? Would like this in 
next Spark release

> Support read data from Hive 2.2 and Hive 2.3 metastore
> --
>
> Key: SPARK-23510
> URL: https://issues.apache.org/jira/browse/SPARK-23510
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23576) SparkSQL - Decimal data missing decimal point

2018-03-02 Thread R (JIRA)

R created SPARK-23576:
-

 Summary: SparkSQL - Decimal data missing decimal point
 Key: SPARK-23576
 URL: https://issues.apache.org/jira/browse/SPARK-23576
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: spark 2.3.0

linux
Reporter: R


Integers like 3 stored as a decimal display in sparksql as 300 with no 
decimal point. But hive displays fine as 3.

Repro steps:
 # Create a .csv with the value 3
 # Use spark to read the csv, cast it as decimal(31,8) and output to an ORC file
 # Use spark to read the ORC, infer the schema (it will infer 38,18 precision) 
and output to a Parquet file
 # Create external hive table to read the parquet ( define the hive type as 
decimal(31,8))
 # Use spark-sql to select from the external hive table.
 # Notice how sparksql shows 300    !!!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-12-14 Thread Anvesh R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291805#comment-16291805
 ] 

Anvesh R commented on SPARK-22036:
--

+1 Issue reproduced on spark-2.2.0 : 

Data at s3 location - s3://bucket/spark-sql-jira/ :
-
100|9

drop table if exists test;
CREATE EXTERNAL TABLE `test` (
adecimal(38,10),
bdecimal(38,10)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://bucket/spark-sql-jira/';

spark-sql> select a,(a*b*0.98765432100) from test;
100 9876444.4445679
Time taken: 11.033 seconds, Fetched 1 row(s)

spark-sql> select a,(a*b*0.987654321000) from test;
100 NULL
Time taken: 0.523 seconds, Fetched 1 row(s)

Changing a column's scale from decimal(38,10) to decimal(38,9) also helped but 
we would loose precision. 

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-20 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Attachment: test_file_without_eof_char.csv

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-20 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259157#comment-16259157
 ] 

Kumaresh C R commented on SPARK-22516:
--

[~mgaido]: Even after I replaced all 'CR LF' to 'LF', still in the below case, 
the error is thrown.

 -> When the file doesn't have 'LF' as the last character in its last line  
i.e. EOF
 (Note: All other lines in the file ends with LF) character

Attached the failing file 'test_file_without_eof_char.csv' for your reference.

Is it something the problem with the parser or the input data (which doesn't 
have any line ending as its last character) ?

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
>

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Labels: csvparser  (was: )

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251110#comment-16251110
 ] 

Kumaresh C R commented on SPARK-22516:
--

[~hyukjin.kwon]: Need your help here :)

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Attachment: testCommentChar.csv

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/text
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/textCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/test
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception. This 
happens only if we pass "comment" == input dataset's last line's first character

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/testCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Notice that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception. This 
happens only if we pass "comment" == input dataset's last line's first character

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at

[jira] [Created] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

Kumaresh C R created SPARK-22516:


 Summary: CSV Read breaks: When "multiLine" = "true", if "comment" 
option is set as last line's first character
 Key: SPARK-22516
 URL: https://issues.apache.org/jira/browse/SPARK-22516
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Kumaresh C R


Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

Notice that it works fine.

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at

[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:24 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581 to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:13 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:21 PM:
---

[~hyukjin.kwon]: Could you please help us here ?This issue occurs  after we 
moved to "multiLine" as "true"


was (Author: crkumaresh24):
[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:20 PM:
---

[~sowen] This is an issue with spark databricks-CSV reading. I could not find 
any such option in the filter. Could you please help me what could be the 
proper component for this bug ?


was (Author: crkumaresh24):
@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R commented on SPARK-21820:
--

@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-21820:
-
Attachment: windows_CRLF.csv

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

Kumaresh C R created SPARK-21820:


 Summary: csv option "multiLine" as "true" not parsing windows line 
feed (CR LF) properly
 Key: SPARK-21820
 URL: https://issues.apache.org/jira/browse/SPARK-21820
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Kumaresh C R


With multiLine=true, windows CR LF is not getting parsed properly. If i make 
multiLine=false, it parses properly. Could you please help here ?

Attached the CSV used in the below commands for your reference.

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-12 Thread Samik R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967146#comment-15967146
 ] 

Samik R commented on SPARK-20310:
-

Hi Sean,

Thanks for your comments. You are probably thinking that I am using scala 
v2.11.0 based on the line ["+-org.scala-lang:scalap:2.11.0"], but this is 
coming from the jackson-json4s dependency of spark-core. I am actually on 
2.11.8 on the box and have that scala version as library dependency in the pom 
file as well. 

I also agree, this may not actually cause a problem (reason why I thought this 
is a minor bug). I was just planning to set the dependency to the latest 
version and hope things work fine. 

Thanks.
-Samik

> Dependency convergence error for scala-xml
> --
>
> Key: SPARK-20310
> URL: https://issues.apache.org/jira/browse/SPARK-20310
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Samik R
>Priority: Minor
>
> Hi,
> I am trying to compile a package (apache tinkerpop) which has spark-core as 
> one of the dependencies. I am trying to compile with v2.1.0. But when I run 
> maven build through a dependency checker, it is showing a dependency error 
> within the spark-core itself for scala-xml package, as below:
> Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
> paths to dependency are:
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.json4s:json4s-jackson_2.11:3.2.11
>   +-org.json4s:json4s-core_2.11:3.2.11
> +-org.scala-lang:scalap:2.11.0
>   +-org.scala-lang:scala-compiler:2.11.0
> +-org.scala-lang.modules:scala-xml_2.11:1.0.1
> and
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.apache.spark:spark-tags_2.11:2.1.0
>   +-org.scalatest:scalatest_2.11:2.2.6
> +-org.scala-lang.modules:scala-xml_2.11:1.0.2
> Can this be fixed?
> Thanks.
> -Samik



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-12 Thread Samik R (JIRA)

Samik R created SPARK-20310:
---

 Summary: Dependency convergence error for scala-xml
 Key: SPARK-20310
 URL: https://issues.apache.org/jira/browse/SPARK-20310
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: Samik R
Priority: Minor


Hi,

I am trying to compile a package (apache tinkerpop) which has spark-core as one 
of the dependencies. I am trying to compile with v2.1.0. But when I run maven 
build through a dependency checker, it is showing a dependency error within the 
spark-core itself for scala-xml package, as below:

Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
paths to dependency are:
+-org.apache.tinkerpop:spark-gremlin:3.2.3
  +-org.apache.spark:spark-core_2.11:2.1.0
+-org.json4s:json4s-jackson_2.11:3.2.11
  +-org.json4s:json4s-core_2.11:3.2.11
+-org.scala-lang:scalap:2.11.0
  +-org.scala-lang:scala-compiler:2.11.0
+-org.scala-lang.modules:scala-xml_2.11:1.0.1
and
+-org.apache.tinkerpop:spark-gremlin:3.2.3
  +-org.apache.spark:spark-core_2.11:2.1.0
+-org.apache.spark:spark-tags_2.11:2.1.0
  +-org.scalatest:scalatest_2.11:2.2.6
+-org.scala-lang.modules:scala-xml_2.11:1.0.2

Can this be fixed?
Thanks.
-Samik




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

2017-02-07 Thread R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

R updated SPARK-19503:
--
Summary: Execution Plan Optimizer: avoid sort or shuffle when it does not 
change end result such as df.sort(...).count()  (was: Dumb Execution Plan)

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19504) clearCache fails to delete orphan RDDs, especially in pyspark

2017-02-07 Thread R (JIRA)

R created SPARK-19504:
-

 Summary: clearCache fails to delete orphan RDDs, especially in 
pyspark
 Key: SPARK-19504
 URL: https://issues.apache.org/jira/browse/SPARK-19504
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0
 Environment: Both pyspark and scala spark. Although scala spark 
uncaches some RDD types even if orphan
Reporter: R
Priority: Minor


x=sc.parallelize([1,3,10,9]).cache()
x.count()
x=sc.parallelize([1,3,10,9]).cache()
x.count()
sqlContex.clearCache()

Overwriting x will create an orphan RDD, which cannot be deleted with 
clearCache(). This happens in both scala and pyspark.

Similar thing happens for rdds created from dataframe in python
spark.read.csv().rdd()
However, in scala clearCache can get rid of some orphan rdd types.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19503) Dumb Execution Plan

2017-02-07 Thread R (JIRA)

R created SPARK-19503:
-

 Summary: Dumb Execution Plan
 Key: SPARK-19503
 URL: https://issues.apache.org/jira/browse/SPARK-19503
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0
 Environment: Perhaps only a pyspark or databricks AWS issue
Reporter: R
Priority: Minor


df.sort(...).count()
performs shuffle and sort and then count! This is wasteful as sort is not 
required here and makes me wonder how smart the algebraic optimiser is indeed! 
The data may be partitioned by known count (such as parquet files) and we 
should not shuffle to just perform count.

This may look trivial, but if optimiser fails to recognise this, I wonder what 
else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17576) As LongType.simpleString in spark is "bigint", Carbon will convert Long to BigInt

2016-09-17 Thread Naresh P R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naresh P R closed SPARK-17576.
--
Resolution: Invalid

> As LongType.simpleString in spark is "bigint", Carbon will convert Long to 
> BigInt
> -
>
> Key: SPARK-17576
> URL: https://issues.apache.org/jira/browse/SPARK-17576
> Project: Spark
>  Issue Type: Improvement
>Reporter: Naresh P R
>Priority: Minor
>
> Describe command will show DataType.simpleString in datatype column,
> For LongType Spark DataType, simpleString is "bigint"
> We are internally using LONG as name for bigint, which needs to be changed to 
> BIGINT in Carbon DataTypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17576) As LongType.simpleString in spark is "bigint", Carbon will convert Long to BigInt

2016-09-17 Thread Naresh P R (JIRA)

Naresh P R created SPARK-17576:
--

 Summary: As LongType.simpleString in spark is "bigint", Carbon 
will convert Long to BigInt
 Key: SPARK-17576
 URL: https://issues.apache.org/jira/browse/SPARK-17576
 Project: Spark
  Issue Type: Improvement
Reporter: Naresh P R
Priority: Minor


Describe command will show DataType.simpleString in datatype column,

For LongType Spark DataType, simpleString is "bigint"

We are internally using LONG as name for bigint, which needs to be changed to 
BIGINT in Carbon DataTypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2016-03-28 Thread Kumaresh C R (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-14194:
-
Description: 
We have CSV content like below,

Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
"1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
Municapality,","USA", "1234567"

Since there is a '\n\r' character in the row middle (to be exact in the Address 
Column), when we execute the below spark code, it tries to create the dataframe 
with two rows (excluding header row), which is wrong. Since we have specified 
delimiter as quote (") character , why it takes the middle character as newline 
character ? This creates an issue while processing the created dataframe.

 DataFrame df = 
sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", delim)
.option("quote", quote)
.option("escape", escape)
.load(sourceFile);

   


> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2016-03-28 Thread Kumaresh C R (JIRA)

Kumaresh C R created SPARK-14194:


 Summary: spark csv reader not working properly if CSV content 
contains CRLF character (newline) in the intermediate cell
 Key: SPARK-14194
 URL: https://issues.apache.org/jira/browse/SPARK-14194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Kumaresh C R






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

48 matches

Mail list logo