Re: Is Structured streaming ready for production usage

2017-06-08 Thread Shixiong(Ryan) Zhu
Please take a look at
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy 
wrote:

> OK. Can we use Spark Kafka Direct with  Structured Streaming?
>
> On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> OK. Can we use Spark Kafka Direct as part of Structured Streaming?
>>
>> On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> YES. At Databricks, our customers have already been using Structured
>>> Streaming and in the last month alone processed over 3 trillion records.
>>>
>>> https://databricks.com/blog/2017/06/06/simple-super-fast-str
>>> eaming-engine-apache-spark.html
>>>
>>> On Thu, Jun 8, 2017 at 3:03 PM, SRK  wrote:
>>>
 Hi,

 Is structured streaming ready for production usage in Spark 2.2?

 Thanks,
 Swetha



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Is-Structured-streaming-ready-for-prod
 uction-usage-tp28751.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>


Re: Is Structured streaming ready for production usage

2017-06-08 Thread swetha kasireddy
OK. Can we use Spark Kafka Direct with  Structured Streaming?

On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy 
wrote:

> OK. Can we use Spark Kafka Direct as part of Structured Streaming?
>
> On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das  > wrote:
>
>> YES. At Databricks, our customers have already been using Structured
>> Streaming and in the last month alone processed over 3 trillion records.
>>
>> https://databricks.com/blog/2017/06/06/simple-super-fast-str
>> eaming-engine-apache-spark.html
>>
>> On Thu, Jun 8, 2017 at 3:03 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> Is structured streaming ready for production usage in Spark 2.2?
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Is-Structured-streaming-ready-for-prod
>>> uction-usage-tp28751.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Is Structured streaming ready for production usage

2017-06-08 Thread swetha kasireddy
OK. Can we use Spark Kafka Direct as part of Structured Streaming?

On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das 
wrote:

> YES. At Databricks, our customers have already been using Structured
> Streaming and in the last month alone processed over 3 trillion records.
>
> https://databricks.com/blog/2017/06/06/simple-super-fast-
> streaming-engine-apache-spark.html
>
> On Thu, Jun 8, 2017 at 3:03 PM, SRK  wrote:
>
>> Hi,
>>
>> Is structured streaming ready for production usage in Spark 2.2?
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Is-Structured-streaming-ready-for-
>> production-usage-tp28751.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Is Structured streaming ready for production usage

2017-06-08 Thread Tathagata Das
YES. At Databricks, our customers have already been using Structured
Streaming and in the last month alone processed over 3 trillion records.

https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html

On Thu, Jun 8, 2017 at 3:03 PM, SRK  wrote:

> Hi,
>
> Is structured streaming ready for production usage in Spark 2.2?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-Structured-streaming-ready-
> for-production-usage-tp28751.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Is Structured streaming ready for production usage

2017-06-08 Thread SRK
Hi,

Is structured streaming ready for production usage in Spark 2.2?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Structured-streaming-ready-for-production-usage-tp28751.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: problem initiating spark context with pyspark

2017-06-08 Thread Marco Mistroni
try this link

http://letstalkspark.blogspot.co.uk/2016/02/getting-started-with-spark-on-window-64.html

it helped me when i had similar problems with windows...

hth

On Wed, Jun 7, 2017 at 3:46 PM, Curtis Burkhalter <
curtisburkhal...@gmail.com> wrote:

> Thanks Doc I saw this on another board yesterday so I've tried this by
> first going to the directory where I've stored the wintutils.exe and then
> as an admin running the command  that you suggested and I get this
> exception when checking the permissions:
>
> C:\winutils\bin>winutils.exe ls -F C:\tmp\hive
> FindFileOwnerAndPermission error (1789): The trust relationship between
> this workstation and the primary domain failed.
>
> I'm fairly new to the command line and determining what the different
> exceptions mean. Do you have any advice what this error means and how I
> might go about fixing this?
>
> Thanks again
>
>
> On Wed, Jun 7, 2017 at 9:51 AM, Doc Dwarf  wrote:
>
>> Hi Curtis,
>>
>> I believe in windows, the following command needs to be executed: (will
>> need winutils installed)
>>
>> D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive
>>
>>
>>
>> On 6 June 2017 at 09:45, Curtis Burkhalter 
>> wrote:
>>
>>> Hello all,
>>>
>>> I'm new to Spark and I'm trying to interact with it using Pyspark. I'm
>>> using the prebuilt version of spark v. 2.1.1 and when I go to the command
>>> line and use the command 'bin\pyspark' I have initialization problems and
>>> get the following message:
>>>
>>> C:\spark\spark-2.1.1-bin-hadoop2.7> bin\pyspark
>>> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41)
>>> [MSC v.1900 64 bit (AMD64)] on win32
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> Using Spark's default log4j profile: org/apache/spark/log4j-default
>>> s.properties
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>> 17/06/06 10:30:14 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 17/06/06 10:30:21 WARN ObjectStore: Version information not found in
>>> metastore. hive.metastore.schema.verification is not enabled so
>>> recording the schema version 1.2.0
>>> 17/06/06 10:30:21 WARN ObjectStore: Failed to get database default,
>>> returning NoSuchObjectException
>>> Traceback (most recent call last):
>>>   File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py",
>>> line 63, in deco
>>> return f(*a, **kw)
>>>   File 
>>> "C:\spark\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py",
>>> line 319, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>> o22.sessionState.
>>> : java.lang.IllegalArgumentException: Error while instantiating
>>> 'org.apache.spark.sql.hive.HiveSessionState':
>>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>>> kSession$$reflect(SparkSession.scala:981)
>>> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(Sp
>>> arkSession.scala:110)
>>> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.
>>> scala:109)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:62)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.jav
>>> a:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>>> ava:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>> ConstructorAccessorImpl.java:62)
>>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>> legatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>>> 23)
>>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>>> kSession$$reflect(SparkSession.scala:978)
>>> ... 13 more
>>> Caused by: java.lang.IllegalArgumentException: Error while
>>> instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
>>> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
>>> sql$internal$SharedState$$reflect(SharedState.scala:169)
>>> at org.apache.spark.sql.internal.SharedState.(Sha

unsubscribe

2017-06-08 Thread Brindha Sengottaiyan



Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Ranadip Chatterjee
Looks like your session user does not have the required privileges on the
remote hdfs directory that is holding the hive data. Since you get the
columns, your session is able to read the metadata, so connection to the
remote hiveserver2 is successful. You should be able to find more
troubleshooting information in the remote hiveserver2 log file.

Try with select* limit 10 to keep it simple.

Ranadip


On 8 Jun 2017 6:31 pm, "Даша Ковальчук"  wrote:

The  result is count = 0.

2017-06-08 19:42 GMT+03:00 ayan guha :

> What is the result of test.count()?
>
> On Fri, 9 Jun 2017 at 1:41 am, Даша Ковальчук 
> wrote:
>
>> Thanks for your reply!
>> Yes, I tried this solution and had the same result. Maybe you have
>> another solution or maybe I can execute query in another way on remote
>> cluster?
>>
>> 2017-06-08 18:30 GMT+03:00 Даша Ковальчук :
>>
>>> Thanks for your reply!
>>> Yes, I tried this solution and had the same result. Maybe you have
>>> another solution or maybe I can execute query in another way on remote
>>> cluster?
>>>
>>
>>> 2017-06-08 18:10 GMT+03:00 Vadim Semenov :
>>>
 Have you tried running a query? something like:

 ```
 test.select("*").limit(10).show()
 ```

 On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук <
 dashakovalchu...@gmail.com> wrote:

> Hi guys,
>
> I need to execute hive queries on remote hive server from spark, but
> for some reasons i receive only column names(without data).
> Data available in table, I checked it via HUE and java jdbc
>  connection.
>
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:
> 1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
>
>
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
>
> But this problem available on Hive with later versions, too.
> I didn't find anything in mail group answers and StackOverflow.
> Could you, please, help me with this issue or could you help me find 
> correct
> solution how to query remote hive from spark?
>
> Thanks in advance!
>


>>> --
> Best Regards,
> Ayan Guha
>


Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Richard Moorhead
You might try pointing your spark context at the hive metastore via:

val conf = new SparkConf()

conf.set("hive.metastore.uris", "your.thrift.server:9083")


val sparkSession = SparkSession.builder()
  .config(conf)
  .enableHiveSupport()
  .getOrCreate()




. . . . . . . . . . . . . . . . . . . . . . . . . . .

Richard Moorhead
Software Engineer
richard.moorh...@c2fo.com

C2FO: The World's Market for Working Capital®

[http://c2fo.com/wp-content/uploads/sites/1/2016/03/LinkedIN.png] 

 [http://c2fo.com/wp-content/uploads/sites/1/2016/03/YouTube.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Twitter.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Googleplus.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Facebook.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Forbes-Fintech-50.png] 


The information contained in this message and any attachment may be privileged, 
confidential, and protected from disclosure. If you are not the intended 
recipient, or an employee, or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly prohibited. If you 
have received this communication in error, please notify us immediately by 
replying to the message and deleting from your computer.




From: Даша Ковальчук 
Sent: Thursday, June 8, 2017 12:30 PM
To: ayan guha
Cc: user@spark.apache.org
Subject: Re: [Spark Core] Does spark support read from remote Hive server via 
JDBC

The  result is count = 0.

2017-06-08 19:42 GMT+03:00 ayan guha 
mailto:guha.a...@gmail.com>>:
What is the result of test.count()?

On Fri, 9 Jun 2017 at 1:41 am, Даша Ковальчук 
mailto:dashakovalchu...@gmail.com>> wrote:
Thanks for your reply!
Yes, I tried this solution and had the same result. Maybe you have another 
solution or maybe I can execute query in another way on remote cluster?

2017-06-08 18:30 GMT+03:00 Даша Ковальчук 
mailto:dashakovalchu...@gmail.com>>:
Thanks for your reply!
Yes, I tried this solution and had the same result. Maybe you have another 
solution or maybe I can execute query in another way on remote cluster?

2017-06-08 18:10 GMT+03:00 Vadim Semenov 
mailto:vadim.seme...@datadoghq.com>>:
Have you tried running a query? something like:

```
test.select("*").limit(10).show()
```

On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук 
mailto:dashakovalchu...@gmail.com>> wrote:
Hi guys,

I need to execute hive queries on remote hive server from spark, but for some 
reasons i receive only column names(without data).
Data available in table, I checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
I didn't find anything in mail group answers and StackOverflow.
Could you, please, help me with this issue or could you help me find correct 
solution how to query remote hive from spark?

Thanks in advance!


--
Best Regards,
Ayan Guha



Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Даша Ковальчук
The  result is count = 0.

2017-06-08 19:42 GMT+03:00 ayan guha :

> What is the result of test.count()?
>
> On Fri, 9 Jun 2017 at 1:41 am, Даша Ковальчук 
> wrote:
>
>> Thanks for your reply!
>> Yes, I tried this solution and had the same result. Maybe you have
>> another solution or maybe I can execute query in another way on remote
>> cluster?
>>
>> 2017-06-08 18:30 GMT+03:00 Даша Ковальчук :
>>
>>> Thanks for your reply!
>>> Yes, I tried this solution and had the same result. Maybe you have
>>> another solution or maybe I can execute query in another way on remote
>>> cluster?
>>>
>>
>>> 2017-06-08 18:10 GMT+03:00 Vadim Semenov :
>>>
 Have you tried running a query? something like:

 ```
 test.select("*").limit(10).show()
 ```

 On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук <
 dashakovalchu...@gmail.com> wrote:

> Hi guys,
>
> I need to execute hive queries on remote hive server from spark, but
> for some reasons i receive only column names(without data).
> Data available in table, I checked it via HUE and java jdbc
>  connection.
>
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:
> 1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
>
>
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
>
> But this problem available on Hive with later versions, too.
> I didn't find anything in mail group answers and StackOverflow.
> Could you, please, help me with this issue or could you help me find 
> correct
> solution how to query remote hive from spark?
>
> Thanks in advance!
>


>>> --
> Best Regards,
> Ayan Guha
>


Re: Read Data From NFS

2017-06-08 Thread ayan guha
Any one?

On Thu, 8 Jun 2017 at 3:26 pm, ayan guha  wrote:

> Hi Guys
>
> Quick one: How spark deals (ie create partitions) with large files sitting
> on NFS, assuming the all executors can see the file exactly same way.
>
> ie, when I run
>
> r = sc.textFile("file://my/file")
>
> what happens if the file is on NFS?
>
> is there any difference from
>
> r = sc.textFile("hdfs://my/file")
>
> Are the input formats used same in both cases?
>
>
> --
> Best Regards,
> Ayan Guha
>
-- 
Best Regards,
Ayan Guha


Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread ayan guha
What is the result of test.count()?
On Fri, 9 Jun 2017 at 1:41 am, Даша Ковальчук 
wrote:

> Thanks for your reply!
> Yes, I tried this solution and had the same result. Maybe you have another
> solution or maybe I can execute query in another way on remote cluster?
>
> 2017-06-08 18:30 GMT+03:00 Даша Ковальчук :
>
>> Thanks for your reply!
>> Yes, I tried this solution and had the same result. Maybe you have
>> another solution or maybe I can execute query in another way on remote
>> cluster?
>>
>
>> 2017-06-08 18:10 GMT+03:00 Vadim Semenov :
>>
>>> Have you tried running a query? something like:
>>>
>>> ```
>>> test.select("*").limit(10).show()
>>> ```
>>>
>>> On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук <
>>> dashakovalchu...@gmail.com> wrote:
>>>
 Hi guys,

 I need to execute hive queries on remote hive server from spark, but
 for some reasons i receive only column names(without data).
 Data available in table, I checked it via HUE and java jdbc connection.

 Here is my code example:
 val test = spark.read
 .option("url", "jdbc:hive2://remote.hive.server:
 1/work_base")
 .option("user", "user")
 .option("password", "password")
 .option("dbtable", "some_table_with_data")
 .option("driver", "org.apache.hive.jdbc.HiveDriver")
 .format("jdbc")
 .load()
 test.show()


 Scala version: 2.11
 Spark version: 2.1.0, i also tried 2.1.1
 Hive version: CDH 5.7 Hive 1.1.1
 Hive JDBC version: 1.1.1

 But this problem available on Hive with later versions, too.
 I didn't find anything in mail group answers and StackOverflow.
 Could you, please, help me with this issue or could you help me find 
 correct
 solution how to query remote hive from spark?

 Thanks in advance!

>>>
>>>
>> --
Best Regards,
Ayan Guha


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Takeshi Yamamuro
I filed a jira about this issue:
https://issues.apache.org/jira/browse/SPARK-21024

On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le  wrote:

> Can you recommend one?
>
> Thanks.
>
> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke  wrote:
>
>> You can change the CSV parser library
>>
>> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
>>
>>
>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>> the error raise from the CSV library that Spark are using.
>>
>>
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>>
>>> The CSV data source allows you to skip invalid lines - this should also
>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>>
>>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>>
>>> Hi Takeshi, Jörn Franke,
>>>
>>> The problem is even I increase the maxColumns it still have some lines
>>> have larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I
>>> set.
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
>>> wrote:
>>>
 Is it not enough to set `maxColumns` in CSV options?

 https://github.com/apache/spark/blob/branch-2.1/sql/
 core/src/main/scala/org/apache/spark/sql/execution/
 datasources/csv/CSVOptions.scala#L116

 // maropu

 On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke 
 wrote:

> Spark CSV data source should be able
>
> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more
> columns than maxColumns (default is 20480). The process of parsing
> was stop.
>
> Internal state when error was thrown: line=1, column=3, record=0,
> charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException
> - 2
> Hint: Number of columns processed may have exceeded limit of 2
> columns. Use settings.setMaxColumns(int) to define the maximum number of
> columns your input can have
> Ensure your configuration is correct, with delimiters, quotes and
> escape sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
>
>
> I did some investigation in univocity
>  library but the way
> it handle is throw error that why spark stop the process.
>
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
>
> Thanks & regards,
> Chanh
> --
> Regards,
> Chanh
>
>


 --
 ---
 Takeshi Yamamuro

>>> --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>



-- 
---
Takeshi Yamamuro


Re: Question about mllib.recommendation.ALS

2017-06-08 Thread Sahib Aulakh [Search] ­
Many thanks. Will try it.
On Thu, Jun 8, 2017 at 8:41 AM Nick Pentreath 
wrote:

> Spark 2.2 will support the recommend-all methods in ML.
>
> Also, both ML and MLLIB performance has been greatly improved for the
> recommend-all methods.
>
> Perhaps you could check out the current RC of Spark 2.2 or master branch
> to try it out?
>
> N
>
> On Thu, 8 Jun 2017 at 17:18, Sahib Aulakh [Search] ­ <
> sahibaul...@coupang.com> wrote:
>
>> Many thanks for your response. I already figured out the details with
>> some help from another forum.
>>
>>
>>1. I was trying to predict ratings for all users and all products.
>>This is inefficient and now I am trying to reduce the number of required
>>predictions.
>>2. There is a nice example buried in Spark source code which points
>>out the usage of ML side ALS.
>>
>> Regards.
>> Sahib Aulakh.
>>
>> On Wed, Jun 7, 2017 at 8:17 PM, Ryan  wrote:
>>
>>> 1. could you give job, stage & task status from Spark UI? I found it
>>> extremely useful for performance tuning.
>>>
>>> 2. use modele.transform for predictions. Usually we have a pipeline for
>>> preparing training data, and use the same pipeline to transform data you
>>> want to predict could give us the prediction column.
>>>
>>> On Thu, Jun 1, 2017 at 7:48 AM, Sahib Aulakh [Search] ­ <
>>> sahibaul...@coupang.com> wrote:
>>>
 Hello:

 I am training the ALS model for recommendations. I have about 200m
 ratings from about 10m users and 3m products. I have a small cluster with
 48 cores and 120gb cluster-wide memory.

 My code is very similar to the example code

 spark/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala
 code.

 I have a couple of questions:


1. All steps up to model training runs reasonably fast. Model
training is under 10 minutes for rank 20. However, the
model.recommendProductsForUsers step is either slow or just does not 
 work
as the code just seems to hang at this point. I have tried user and 
 product
blocks sizes of -1 and 20, 40, etc, played with executor memory size, 
 etc.
Can someone shed some light here as to what could be wrong?
2. Also, is there any example code for the ml.recommendation.ALS
algorithm? I can figure out how to train the model but I don't 
 understand
(from the documentation) how to perform predictions?

 Thanks for any information you can provide.
 Sahib Aulakh.


 --
 Sahib Aulakh
 Sr. Principal Engineer

>>>
>>>
>>
>>
>> --
>> Sahib Aulakh
>> Sr. Principal Engineer
>>
> --
Sahib Aulakh
Sr. Principal Engineer


Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Даша Ковальчук
Thanks for your reply!
Yes, I tried this solution and had the same result. Maybe you have another
solution or maybe I can execute query in another way on remote cluster?

2017-06-08 18:30 GMT+03:00 Даша Ковальчук :

> Thanks for your reply!
> Yes, I tried this solution and had the same result. Maybe you have another
> solution or maybe I can execute query in another way on remote cluster?
>
> 2017-06-08 18:10 GMT+03:00 Vadim Semenov :
>
>> Have you tried running a query? something like:
>>
>> ```
>> test.select("*").limit(10).show()
>> ```
>>
>> On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук <
>> dashakovalchu...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> I need to execute hive queries on remote hive server from spark, but
>>> for some reasons i receive only column names(without data).
>>> Data available in table, I checked it via HUE and java jdbc connection.
>>>
>>> Here is my code example:
>>> val test = spark.read
>>> .option("url", "jdbc:hive2://remote.hive.server:
>>> 1/work_base")
>>> .option("user", "user")
>>> .option("password", "password")
>>> .option("dbtable", "some_table_with_data")
>>> .option("driver", "org.apache.hive.jdbc.HiveDriver")
>>> .format("jdbc")
>>> .load()
>>> test.show()
>>>
>>>
>>> Scala version: 2.11
>>> Spark version: 2.1.0, i also tried 2.1.1
>>> Hive version: CDH 5.7 Hive 1.1.1
>>> Hive JDBC version: 1.1.1
>>>
>>> But this problem available on Hive with later versions, too.
>>> I didn't find anything in mail group answers and StackOverflow.
>>> Could you, please, help me with this issue or could you help me find correct
>>> solution how to query remote hive from spark?
>>>
>>> Thanks in advance!
>>>
>>
>>
>


Re: Question about mllib.recommendation.ALS

2017-06-08 Thread Nick Pentreath
Spark 2.2 will support the recommend-all methods in ML.

Also, both ML and MLLIB performance has been greatly improved for the
recommend-all methods.

Perhaps you could check out the current RC of Spark 2.2 or master branch to
try it out?

N

On Thu, 8 Jun 2017 at 17:18, Sahib Aulakh [Search] ­ <
sahibaul...@coupang.com> wrote:

> Many thanks for your response. I already figured out the details with some
> help from another forum.
>
>
>1. I was trying to predict ratings for all users and all products.
>This is inefficient and now I am trying to reduce the number of required
>predictions.
>2. There is a nice example buried in Spark source code which points
>out the usage of ML side ALS.
>
> Regards.
> Sahib Aulakh.
>
> On Wed, Jun 7, 2017 at 8:17 PM, Ryan  wrote:
>
>> 1. could you give job, stage & task status from Spark UI? I found it
>> extremely useful for performance tuning.
>>
>> 2. use modele.transform for predictions. Usually we have a pipeline for
>> preparing training data, and use the same pipeline to transform data you
>> want to predict could give us the prediction column.
>>
>> On Thu, Jun 1, 2017 at 7:48 AM, Sahib Aulakh [Search] ­ <
>> sahibaul...@coupang.com> wrote:
>>
>>> Hello:
>>>
>>> I am training the ALS model for recommendations. I have about 200m
>>> ratings from about 10m users and 3m products. I have a small cluster with
>>> 48 cores and 120gb cluster-wide memory.
>>>
>>> My code is very similar to the example code
>>>
>>> spark/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala
>>> code.
>>>
>>> I have a couple of questions:
>>>
>>>
>>>1. All steps up to model training runs reasonably fast. Model
>>>training is under 10 minutes for rank 20. However, the
>>>model.recommendProductsForUsers step is either slow or just does not work
>>>as the code just seems to hang at this point. I have tried user and 
>>> product
>>>blocks sizes of -1 and 20, 40, etc, played with executor memory size, 
>>> etc.
>>>Can someone shed some light here as to what could be wrong?
>>>2. Also, is there any example code for the ml.recommendation.ALS
>>>algorithm? I can figure out how to train the model but I don't understand
>>>(from the documentation) how to perform predictions?
>>>
>>> Thanks for any information you can provide.
>>> Sahib Aulakh.
>>>
>>>
>>> --
>>> Sahib Aulakh
>>> Sr. Principal Engineer
>>>
>>
>>
>
>
> --
> Sahib Aulakh
> Sr. Principal Engineer
>


Re: Question about mllib.recommendation.ALS

2017-06-08 Thread Sahib Aulakh [Search] ­
Many thanks for your response. I already figured out the details with some
help from another forum.


   1. I was trying to predict ratings for all users and all products. This
   is inefficient and now I am trying to reduce the number of required
   predictions.
   2. There is a nice example buried in Spark source code which points out
   the usage of ML side ALS.

Regards.
Sahib Aulakh.

On Wed, Jun 7, 2017 at 8:17 PM, Ryan  wrote:

> 1. could you give job, stage & task status from Spark UI? I found it
> extremely useful for performance tuning.
>
> 2. use modele.transform for predictions. Usually we have a pipeline for
> preparing training data, and use the same pipeline to transform data you
> want to predict could give us the prediction column.
>
> On Thu, Jun 1, 2017 at 7:48 AM, Sahib Aulakh [Search] ­ <
> sahibaul...@coupang.com> wrote:
>
>> Hello:
>>
>> I am training the ALS model for recommendations. I have about 200m
>> ratings from about 10m users and 3m products. I have a small cluster with
>> 48 cores and 120gb cluster-wide memory.
>>
>> My code is very similar to the example code
>>
>> spark/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala
>> code.
>>
>> I have a couple of questions:
>>
>>
>>1. All steps up to model training runs reasonably fast. Model
>>training is under 10 minutes for rank 20. However, the
>>model.recommendProductsForUsers step is either slow or just does not
>>work as the code just seems to hang at this point. I have tried user and
>>product blocks sizes of -1 and 20, 40, etc, played with executor memory
>>size, etc. Can someone shed some light here as to what could be wrong?
>>2. Also, is there any example code for the ml.recommendation.ALS
>>algorithm? I can figure out how to train the model but I don't understand
>>(from the documentation) how to perform predictions?
>>
>> Thanks for any information you can provide.
>> Sahib Aulakh.
>>
>>
>> --
>> Sahib Aulakh
>> Sr. Principal Engineer
>>
>
>


-- 
Sahib Aulakh
Sr. Principal Engineer


Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Vadim Semenov
Have you tried running a query? something like:

```
test.select("*").limit(10).show()
```

On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук 
wrote:

> Hi guys,
>
> I need to execute hive queries on remote hive server from spark, but for
> some reasons i receive only column names(without data).
> Data available in table, I checked it via HUE and java jdbc connection.
>
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
>
>
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
>
> But this problem available on Hive with later versions, too.
> I didn't find anything in mail group answers and StackOverflow.
> Could you, please, help me with this issue or could you help me find correct
> solution how to query remote hive from spark?
>
> Thanks in advance!
>


Re: Worker node log not showed

2017-06-08 Thread Eike von Seggern
2017-05-31 10:48 GMT+02:00 Paolo Patierno :

> No it's running in standalone mode as Docker image on Kubernetes.
>
>
> The only way I found was to access "stderr" file created under the "work"
> directory in the SPARK_HOME but ... is it the right way ?
>

I think that is the right way. I haven't looked in the documentation, but I
think in a stand-alone cluster you have a master node, that manages your
worker node, each node running one "management" process. When you submit a
job, these management processes spawn "executor" processes which have
stdout/-err going to $SPARK_HOME/work/…, but are not piped through the
management processes. The logs should also be available through the web UI
on port 8081 of the worker.

Best

Eike


Re: Scala, Python or Java for Spark programming

2017-06-08 Thread JB Data
Java is Object langage borned to Data, Python is Data langage borned to
Objects or else... Eachone has its owns uses.



@JBD 


2017-06-08 8:44 GMT+02:00 Jörn Franke :

> A slight advantage of Java is also the tooling that exist around it -
> better support by build tools and plugins, advanced static code analysis
> (security, bugs, performance) etc.
>
> On 8. Jun 2017, at 08:20, Mich Talebzadeh 
> wrote:
>
> What I like about Scala is that it is less ceremonial compared to Java.
> Java users claim that Scala is built on Java so the error tracking is very
> difficult. Also Scala sits on top of Java and that makes it virtually
> depending on Java.
>
> For me the advantage of Scala is its simplicity and compactness. I can
> write a Spark streaming code in Sala pretty fast or import massive RDBMS
> table into Hive and table of my design equally very fast using Scala.
>
> I don't know may be I cannot be bothered writing 100 lines of Java for a
> simple query from a table :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 June 2017 at 00:11, Matt Tenenbaum 
> wrote:
>
>> A lot depends on your context as well. If I'm using Spark _for analysis_,
>> I frequently use python; it's a starting point, from which I can then
>> leverage pandas, matplotlib/seaborn, and other powerful tools available on
>> top of python.
>>
>> If the Spark outputs are the ends themselves, rather than the means to
>> further exploration, Scala still feels like the "first class"
>> language---most thorough feature set, best debugging support, etc.
>>
>> More crudely: if the eventual goal is a dataset, I tend to prefer Scala;
>> if it's a visualization or some summary values, I tend to prefer Python.
>>
>> Of course, I also agree that this is more theological than technical.
>> Appropriately size your grains of salt.
>>
>> Cheers
>> -mt
>>
>> On Wed, Jun 7, 2017 at 12:39 PM, Bryan Jeffrey 
>> wrote:
>>
>>> Mich,
>>>
>>> We use Scala for a large project.  On our team we've set a few standards
>>> to ensure readability (we try to avoid excessive use of tuples, use named
>>> functions, etc.)  Given these constraints, I find Scala to be very
>>> readable, and far easier to use than Java.  The Lambda functionality of
>>> Java provides a lot of similar features, but the amount of typing required
>>> to set down a small function is excessive at best!
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>> On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke 
>>> wrote:
>>>
 I think this is a religious question ;-)
 Java is often underestimated, because people are not aware of its
 lambda functionality which makes the code very readable. Scala - it depends
 who programs it. People coming with the normal Java background write
 Java-like code in scala which might not be so good. People from a
 functional background write it more functional like - i.e. You have a lot
 of things in one line of code which can be a curse even for other
 functional programmers, especially if the application is distributed as in
 the case of Spark. Usually no comment is provided and you have - even as a
 functional programmer - to do a lot of drill down. Python is somehow
 similar, but since it has no connection with Java you do not have these
 extremes. There it depends more on the community (e.g. Medical, financials)
 and skills of people how the code look likes.
 However the difficulty comes with the distributed applications behind
 Spark which may have unforeseen side effects if the users do not know this,
 ie if they have never been used to parallel programming.

 On 7. Jun 2017, at 17:20, Mich Talebzadeh 
 wrote:


 Hi,

 I am a fan of Scala and functional programming hence I prefer Scala.

 I had a discussion with a hardcore Java programmer and a data scientist
 who prefers Python.

 Their view is that in a collaborative work using Scala programming it
 is almost impossible to understand someone else's Scala code.

 Hence I was wondering how much truth is there in this statement. Given
 that Spark uses Scala as its core development language, what is the general
 view on the use of Scala, Python or Java?

 Thanks,

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 

Output of select in non exponential form.

2017-06-08 Thread kundan kumar
predictions.select("prediction", "label", "features").show(5)


I have labels as line numbers but they are getting printed in exponential
format. Is there a way to print it in normal double notation.


Kundan


[Spark STREAMING]: Can not kill job gracefully on spark standalone cluster

2017-06-08 Thread Mariusz D.
There is a problem with killing jobs gracefully in spark 2.1.0 with enabled
spark.streaming.stopGracefullyOnShutdown
I tested killing spark jobs in many ways and I got some conclusions.
1. With command spark-submit --master spark:// --kill driver-id
results: It killed all workers almost immediately
2. With api curl -X POST http://localhost:6066/v1/submissions/kill/driverId
results: the same like in 1. (I looked at code and it seems like
spark-submit calls just this REST endpoint)
3. With unix kill driver-process
results: it didn't kill the job at all

Then I noticed that I used param: --supervise so I repeated these all
tests. It turned out that 1 and 2 ways worked in the same way like before
but 3 option worked like I assumed. This mean, calling kill driver-process
job , it processed all messages which left and than turned down job
gracefully. It is of course some solution but quite inconvenient since I
must track machine with driver instead of using simple spark REST endpoint.

I have read many of issues posted on blogs and stackoverflow and a lot of
people struggle with this problem.

Is anybody able to explain me why there are so many issues regarding this
case and why 1 and 2 method works in different way than kill.

Kind Regards,
Mariusz Dubielecki


Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-08 Thread Satish John Bosco
I have tried the configuration calculator sheet provided by Cloudera as
well but no improvements. However, ignoring the 17 mil operation to begin
with.

Let consider the simple sort on yarn and spark which has tremendous
difference.

The operation is simple Selected numeric col to be sorted ascending and
below is the results.

> 136 seconds - Yarn-client mode
> 40 seconds  - Spark Standalone mode

Can you guide me on having a simple yarn-site.xml configuration that should
be the bare minimum for the below hardware at least. So that I can see if I
am missing or overlooked any key configurations . Also if running in spark
Standalone mode the configuration of spark-env.sh and spark-defaults as to
how many instances to choose with memory and cores.

32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)

Finally this key is mystifying as to why it created such performance
difference in spark 1.6.1 is not understood spark.sql.
autoBroadcastJoinThreshold::-1.





On Wed, Jun 7, 2017 at 11:16 AM, Jörn Franke  wrote:

> What does your Spark job do? Have you tried standard configurations and
> changing them gradually?
>
> Have you checked the logfiles/ui which tasks  take long?
>
> 17 Mio records does not sound much, but it depends what you do with it.
>
> I do not think that for such a small "cluster" it makes sense to have a
> special scheduling configuration.
>
> > On 6. Jun 2017, at 18:02, satishjohn  wrote:
> >
> > Performance issue / time taken to complete spark job in yarn is 4 x
> slower,
> > when considered spark standalone mode. However, in spark standalone mode
> > jobs often fails with executor lost issue.
> >
> > Hardware configuration
> >
> >
> > 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)
> >
> > Spark configuration:
> >
> >
> > spark.executor.memory 7g
> > Spark cores Max 96
> > Spark driver 5GB
> > spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job
> fails or
> > job takes 50x times more time)
> > spark.driver.maxResultSize::2g
> > spark.driver.memory::5g
> > No of Instances 4 per machine.
> >
> > With the above spark configuration the spark job for the business flow
> of 17
> > million records completes in 8 Minutes.
> >
> > Problem Area:
> >
> >
> > When run in yarn client mode with the below configuration which takes 33
> to
> > 42 minutes to run the same flow. Below is the yarn-site.xml configuration
> > data.
> >
> > 
> >  yarn.label.enabledtrue
> >
> > yarn.log-aggregation.enable-local-
> cleanupfalse
> >
> > yarn.resourcemanager.scheduler.
> client.thread-count64
> >
> > yarn.resourcemanager.resource-
> tracker.addresssatish-NS1:8031
> >
> > yarn.resourcemanager.scheduler.
> addresssatish-NS1:8030
> >
> > yarn.dispatcher.exit-on-error name>true
> >
> > yarn.nodemanager.container-manager.
> thread-count64
> >
> > yarn.nodemanager.local-dirs<
> value>/home/satish/yarn
> >
> > yarn.nodemanager.localizer.fetch.
> thread-count20
> >
> > yarn.resourcemanager.address
> satish-NS1:8032
> >
> > yarn.scheduler.increment-allocation-mb
> 512
> >
> > yarn.log.server.urlhttp:/
> /satish-NS1:19888/jobhistory/logs
> >
> > yarn.nodemanager.resource.memory-
> mb28000
> >
> > yarn.nodemanager.labelsMASTER property>
> >
> > yarn.nodemanager.resource.cpu-
> vcores48
> >
> > yarn.scheduler.minimum-allocation-
> mb1024
> >
> > yarn.log-aggregation-enable<
> value>true
> >
> > yarn.nodemanager.localizer.client.
> thread-count20
> >
> > yarn.app.mapreduce.am.labels<
> value>CORE
> >
> > yarn.log-aggregation.retain-seconds name>172800
> >
> > yarn.nodemanager.address<
> value>${yarn.nodemanager.hostname}:8041
> >
> > yarn.resourcemanager.hostname name>satish-NS1
> >
> > yarn.scheduler.maximum-allocation-
> mb8192
> >
> > yarn.nodemanager.remote-app-log-
> dir/home/satish/satish/hadoop-yarn/apps
> >
> > yarn.resourcemanager.resource-
> tracker.client.thread-count64
> >
> > yarn.scheduler.maximum-allocation-
> vcores1
> >
> > yarn.nodemanager.aux-services name>mapreduce_shuffle,
> >
> > yarn.nodemanager.aux-services.
> mapreduce_shuffle.classorg.apache.hadoop.
> mapred.ShuffleHandler
> >
> > yarn.resourcemanager.client.thread-
> count64
> >
> > yarn.nodemanager.container-metrics.
> enabletrue
> >
> > yarn.nodemanager.log-dirs<
> value>/home/satish/hadoop-yarn/containers
> >   yarn.nodemanager.aux-services
> > spark_shuffle,mapreduce_shuffle
> > 
> > yarn.nodemanager.aux-services.mapreduce.shuffle.class
> > org.apache.hadoop.mapred.ShuffleHandler
> >  yarn.nodemanager.aux-services.
> spark_shuffle.class
> > org.apache.spark.network.yarn.YarnShuffleService property>
> >
> > yarn.scheduler.minimum-allocation-
> vcores1
> >  yarn.scheduler.increment-allocation-vcores
> > 1
> >  yarn.resourcemanager.scheduler.class
> > org.apache.hadoop.yarn.server.resourcemanager.
> scheduler.fair.FairScheduler
> > yarn.scheduler.fair.preemption name>true
> >
> > 
> >
> > Also in capacity scheduler I am using Dominant resource calculator. I
> have
> > tried hands on other fair and default

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Chanh Le
Can you recommend one?

Thanks.

On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke  wrote:

> You can change the CSV parser library
>
> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
>
>
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
> the error raise from the CSV library that Spark are using.
>
>
> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>
>> The CSV data source allows you to skip invalid lines - this should also
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>
>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>
>> Hi Takeshi, Jörn Franke,
>>
>> The problem is even I increase the maxColumns it still have some lines
>> have larger columns than the one I set and it will cost a lot of memory.
>> So I just wanna skip the line has larger columns than the maxColumns I
>> set.
>>
>> Regards,
>> Chanh
>>
>>
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
>> wrote:
>>
>>> Is it not enough to set `maxColumns` in CSV options?
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>
>>> // maropu
>>>
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke 
>>> wrote:
>>>
 Spark CSV data source should be able

 On 7. Jun 2017, at 17:50, Chanh Le  wrote:

 Hi everyone,
 I am using Spark 2.1.1 to read csv files and convert to avro files.
 One problem that I am facing is if one row of csv file has more columns
 than maxColumns (default is 20480). The process of parsing was stop.

 Internal state when error was thrown: line=1, column=3, record=0,
 charIndex=12
 com.univocity.parsers.common.TextParsingException:
 java.lang.ArrayIndexOutOfBoundsException - 2
 Hint: Number of columns processed may have exceeded limit of 2 columns.
 Use settings.setMaxColumns(int) to define the maximum number of columns
 your input can have
 Ensure your configuration is correct, with delimiters, quotes and
 escape sequences that match the input format you are trying to parse
 Parser Configuration: CsvParserSettings:


 I did some investigation in univocity
  library but the way
 it handle is throw error that why spark stop the process.

 How to skip the invalid row and just continue to parse next valid one?
 Any libs can replace univocity in that job?

 Thanks & regards,
 Chanh
 --
 Regards,
 Chanh


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>
> --
Regards,
Chanh


[Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Даша Ковальчук
Hi guys,

I need to execute hive queries on remote hive server from spark, but for
some reasons i receive only column names(without data).
Data available in table, I checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
I didn't find anything in mail group answers and StackOverflow.
Could you, please, help me with this issue or could you help me find correct
solution how to query remote hive from spark?

Thanks in advance!


Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Patrik Medvedev
Hello guys,

Can somebody help me with my problem?
Let me know, if you need more details.


ср, 7 июн. 2017 г. в 16:43, Patrik Medvedev :

> No, I don't.
>
> ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin :
>
>> Do you have some other security in place like Kerberos or impersonation?
>> It may affect your access.
>>
>>
>> jg
>>
>>
>> On Jun 7, 2017, at 02:15, Patrik Medvedev 
>> wrote:
>>
>> Hello guys,
>>
>> I need to execute hive queries on remote hive server from spark, but for
>> some reasons i receive only column names(without data).
>> Data available in table, i checked it via HUE and java jdbc connection.
>>
>> Here is my code example:
>> val test = spark.read
>> .option("url", "jdbc:hive2://remote.hive.server:1/work_base")
>> .option("user", "user")
>> .option("password", "password")
>> .option("dbtable", "some_table_with_data")
>> .option("driver", "org.apache.hive.jdbc.HiveDriver")
>> .format("jdbc")
>> .load()
>> test.show()
>>
>>
>> Scala version: 2.11
>> Spark version: 2.1.0, i also tried 2.1.1
>> Hive version: CDH 5.7 Hive 1.1.1
>> Hive JDBC version: 1.1.1
>>
>> But this problem available on Hive with later versions, too.
>> Could you help me with this issue, because i didn't find anything in mail
>> group answers and StackOverflow.
>> Or could you help me find correct solution how to query remote hive from
>> spark?
>>
>> --
>> *Cheers,*
>> *Patrick*
>>
>>


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Jörn Franke
You can change the CSV parser library 

> On 8. Jun 2017, at 08:35, Chanh Le  wrote:
> 
> 
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because the 
> error raise from the CSV library that Spark are using.
> 
> 
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke  wrote:
>> The CSV data source allows you to skip invalid lines - this should also 
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>> 
>>> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
>>> 
>>> Hi Takeshi, Jörn Franke,
>>> 
>>> The problem is even I increase the maxColumns it still have some lines have 
>>> larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I set.
>>> 
>>> Regards,
>>> Chanh
>>> 
>>> 
 On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro  
 wrote:
 Is it not enough to set `maxColumns` in CSV options?
 
 https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
 
 // maropu
 
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
> Spark CSV data source should be able
> 
>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>> 
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more columns 
>> than maxColumns (default is 20480). The process of parsing was stop.
>> 
>> Internal state when error was thrown: line=1, column=3, record=0, 
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException: 
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2 columns. 
>> Use settings.setMaxColumns(int) to define the maximum number of columns 
>> your input can have
>> Ensure your configuration is correct, with delimiters, quotes and escape 
>> sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>> 
>> 
>> I did some investigation in univocity library but the way it handle is 
>> throw error that why spark stop the process.
>> 
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>> 
>> Thanks & regards,
>> Chanh
>> -- 
>> Regards,
>> Chanh
 
 
 
 -- 
 ---
 Takeshi Yamamuro
>>> 
>>> -- 
>>> Regards,
>>> Chanh
> 
> -- 
> Regards,
> Chanh