SPARK-19547

2017-06-07 Thread Rastogi, Pankaj
Hi,
 I have been trying to distribute Kafka topics among different instances of 
same consumer group. I am using KafkaDirectStream API for creating DStreams. 
After the second consumer group comes up, Kafka does partition rebalance and 
then Spark driver of the first consumer dies with the following exception:

java.lang.IllegalStateException: No current assignment for partition myTopic_5-0
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:264)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:336)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1236)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 

Read Data From NFS

2017-06-07 Thread ayan guha
Hi Guys

Quick one: How spark deals (ie create partitions) with large files sitting
on NFS, assuming the all executors can see the file exactly same way.

ie, when I run

r = sc.textFile("file://my/file")

what happens if the file is on NFS?

is there any difference from

r = sc.textFile("hdfs://my/file")

Are the input formats used same in both cases?

-- 
Best Regards,
Ayan Guha


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
The CSV data source allows you to skip invalid lines - this should also include 
lines that have more than maxColumns. Choose mode "DROPMALFORMED"

> On 8. Jun 2017, at 03:04, Chanh Le  wrote:
> 
> Hi Takeshi, Jörn Franke,
> 
> The problem is even I increase the maxColumns it still have some lines have 
> larger columns than the one I set and it will cost a lot of memory.
> So I just wanna skip the line has larger columns than the maxColumns I set.
> 
> Regards,
> Chanh
> 
> 
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro  
>> wrote:
>> Is it not enough to set `maxColumns` in CSV options?
>> 
>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>> 
>> // maropu
>> 
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
>>> Spark CSV data source should be able
>>> 
 On 7. Jun 2017, at 17:50, Chanh Le  wrote:
 
 Hi everyone,
 I am using Spark 2.1.1 to read csv files and convert to avro files.
 One problem that I am facing is if one row of csv file has more columns 
 than maxColumns (default is 20480). The process of parsing was stop.
 
 Internal state when error was thrown: line=1, column=3, record=0, 
 charIndex=12
 com.univocity.parsers.common.TextParsingException: 
 java.lang.ArrayIndexOutOfBoundsException - 2
 Hint: Number of columns processed may have exceeded limit of 2 columns. 
 Use settings.setMaxColumns(int) to define the maximum number of columns 
 your input can have
 Ensure your configuration is correct, with delimiters, quotes and escape 
 sequences that match the input format you are trying to parse
 Parser Configuration: CsvParserSettings:
 
 
 I did some investigation in univocity library but the way it handle is 
 throw error that why spark stop the process.
 
 How to skip the invalid row and just continue to parse next valid one?
 Any libs can replace univocity in that job?
 
 Thanks & regards,
 Chanh
 -- 
 Regards,
 Chanh
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 
> -- 
> Regards,
> Chanh


Re: No TypeTag Available for String

2017-06-07 Thread Ryan
did you include the proper scala-reflect dependency?

On Wed, May 31, 2017 at 1:01 AM, krishmah  wrote:

> I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works
> with Scala 2.10.6. Please advise if I am missing something
>
> import org.apache.spark.sql.functions.udf
>
> val getFileName = udf{z:String => z.takeRight(z.length
> -z.lastIndexOf("/")-1)}
>
> and this gives me following error messages
>
> No Type Tag Available for String and
>
> not enough arguments for method udf: (implicit evidence$2:
> reflect.runtime.universe.TypeTag[String], implicit evidence$3:
> reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.expressions.
> UserDefinedFunction.
> Unspecified value parameters evidence$2, evidence$3.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/No-TypeTag-Available-for-String-tp28720.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Worker node log not showed

2017-06-07 Thread Ryan
I think you need to get the logger within the lambda, otherwise it's the
logger on driver side which can't work.

On Wed, May 31, 2017 at 4:48 PM, Paolo Patierno  wrote:

> No it's running in standalone mode as Docker image on Kubernetes.
>
>
> The only way I found was to access "stderr" file created under the "work"
> directory in the SPARK_HOME but ... is it the right way ?
>
>
> *Paolo Patierno*
>
> *Senior Software Engineer (IoT) @ Red Hat **Microsoft MVP on **Windows
> Embedded & IoT*
> *Microsoft Azure Advisor*
>
> Twitter : @ppatierno 
> Linkedin : paolopatierno 
> Blog : DevExperience 
>
>
> --
> *From:* Alonso Isidoro Roman 
> *Sent:* Wednesday, May 31, 2017 8:39 AM
> *To:* Paolo Patierno
> *Cc:* user@spark.apache.org
> *Subject:* Re: Worker node log not showed
>
> Are you running the code with yarn?
>
> if so, figure out the applicationID through the web ui, then run the next
> command:
>
> yarn logs your_application_id
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-05-31 9:42 GMT+02:00 Paolo Patierno :
>
>> Hi all,
>>
>>
>> I have a simple cluster with one master and one worker. On another
>> machine I launch the driver where at some point I have following line of
>> codes :
>>
>>
>> max.foreachRDD(rdd -> {
>>
>> LOG.info("*** max.foreachRDD");
>>
>> rdd.foreach(value -> {
>>
>> LOG.info("*** rdd.foreach");
>>
>> });
>> });
>>
>>
>> The message "*** max.foreachRDD" is visible in the console of the driver
>> machine ... and it's ok.
>> I can't see the "*** rdd.foreach" message that should be executed on the
>> worker node right ? Btw on the worker node console I can't see it. Why ?
>>
>> My need is to log what happens in the code executed on worker node
>> (because it works if I execute it on master local[*] but not when submitted
>> to a worker node).
>>
>> Following the log4j.properties file I put in the /conf dir :
>>
>> # Set everything to be logged to the console
>> log4j.rootCategory=INFO, console
>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>> log4j.appender.console.target=System.err
>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
>> %c{1}: %m%n
>>
>> # Settings to quiet third party logs that are too verbose
>> log4j.logger.org.spark-project.jetty=WARN
>> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
>> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
>> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
>>
>>
>>
>> Thanks
>> Paolo.
>>
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat **Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>
>


Re: good http sync client to be used with spark

2017-06-07 Thread Ryan
we use AsyncHttpClient(from the java world) and simply call future.get as
synchronous call.

On Thu, Jun 1, 2017 at 4:08 AM, vimal dinakaran  wrote:

> Hi,
>  In our application pipeline we need to push the data from spark streaming
> to a http server.
>
> I would like to have a http client with below requirements.
>
> 1. synchronous calls
> 2. Http connection pool support
> 3. light weight and easy to use.
>
> spray,akka http are mostly suited for async call . Correct me if I am
> wrong.
>
> Could you please let me know what is the client that suits the above ?
>


Re: Question about mllib.recommendation.ALS

2017-06-07 Thread Ryan
1. could you give job, stage & task status from Spark UI? I found it
extremely useful for performance tuning.

2. use modele.transform for predictions. Usually we have a pipeline for
preparing training data, and use the same pipeline to transform data you
want to predict could give us the prediction column.

On Thu, Jun 1, 2017 at 7:48 AM, Sahib Aulakh [Search] ­ <
sahibaul...@coupang.com> wrote:

> Hello:
>
> I am training the ALS model for recommendations. I have about 200m ratings
> from about 10m users and 3m products. I have a small cluster with 48 cores
> and 120gb cluster-wide memory.
>
> My code is very similar to the example code
>
> spark/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala
> code.
>
> I have a couple of questions:
>
>
>1. All steps up to model training runs reasonably fast. Model training
>is under 10 minutes for rank 20. However, the 
> model.recommendProductsForUsers
>step is either slow or just does not work as the code just seems to hang at
>this point. I have tried user and product blocks sizes of -1 and 20, 40,
>etc, played with executor memory size, etc. Can someone shed some light
>here as to what could be wrong?
>2. Also, is there any example code for the ml.recommendation.ALS
>algorithm? I can figure out how to train the model but I don't understand
>(from the documentation) how to perform predictions?
>
> Thanks for any information you can provide.
> Sahib Aulakh.
>
>
> --
> Sahib Aulakh
> Sr. Principal Engineer
>


Re: Java SPI jar reload in Spark

2017-06-07 Thread Ryan
I'd suggest scripts like js, groovy, etc.. To my understanding the service
loader mechanism isn't a good fit for runtime reloading.

On Wed, Jun 7, 2017 at 4:55 PM, Jonnas Li(Contractor) <
zhongshuang...@envisioncn.com> wrote:

> To be more explicit, I used mapwithState() in my application, just like
> this:
>
> stream = KafkaUtils.createStream(..)
> mappedStream = stream.mapPartitionToPair(..)
> stateStream = mappedStream.mapwithState(*MyUpdateFunc*(..))
> stateStream.foreachRDD(..)
>
> I call the jar in *MyUpdateFunc()*, and the jar reloading is triggered by
> some other event.
>
> I'm not sure if this approach is feasible. To my understand, Spark will
> checkpoint the status, so the application can’t be updated at runtime,
> that’s why I got the exception.
>
> Any suggestion is welcome, if there is any other idea to do something like
> this, I just want to provide a approach to enable users can customize for
> their business logic.
>
> Regards
> 李忠双 / Jonnas
>
> 发件人: Zhongshuang Li 
> 日期: 2017年6月6日 星期二 下午6:30
> 至: Alonso Isidoro Roman 
>
> 抄送: Jörn Franke , "user@spark.apache.org" <
> user@spark.apache.org>
> 主题: Re: Java SPI jar reload in Spark
>
> I used java.util.ServiceLoader
>   ,
> as the javadoc says it supports reloading.
> Please point it out if I'm mis-understanding.
>
> Regards
> Jonnas Li
>
> 发件人: Alonso Isidoro Roman 
> 日期: 2017年6月6日 星期二 下午6:21
> 至: Zhongshuang Li 
> 抄送: Jörn Franke , "user@spark.apache.org" <
> user@spark.apache.org>
> 主题: Re: Java SPI jar reload in Spark
>
> Hi, a quick search on google.
>
> https://github.com/spark-jobserver/spark-jobserver/issues/130
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-06-06 12:14 GMT+02:00 Jonnas Li(Contractor) <
> zhongshuang...@envisioncn.com>:
>
>> Thank for your quick response.
>> These jars are used to define some customize business logic, and they can
>> be treat as plug-ins in our business scenario. And the jars are
>> developed/maintain by some third-party partners, this means there will be
>> some version updating.
>> My expectation is update the business code with restarting the spark
>> streaming job, any suggestion will be very grateful.
>>
>> Regards
>> Jonnas Li
>>
>> 发件人: Jörn Franke 
>> 日期: 2017年6月6日 星期二 下午5:55
>> 至: Zhongshuang Li 
>> 抄送: "user@spark.apache.org" 
>> 主题: Re: Java SPI jar reload in Spark
>>
>> Why do you need jar reloading? What functionality is executed during jar
>> reloading. Maybe there is another way to achieve the same without jar
>> reloading. In fact, it might be dangerous from a functional point of view-
>> functionality in jar changed and all your computation is wrong.
>>
>> On 6. Jun 2017, at 11:35, Jonnas Li(Contractor) <
>> zhongshuang...@envisioncn.com> wrote:
>>
>> I have a Spark Streaming application, which dynamically calling a jar
>> (Java SPI), and the jar is called in a mapWithState() function, it was
>> working fine for a long time.
>> Recently, I got a requirement which required to reload the jar during
>> runtime.
>> But when the reloading is completed, the spark streaming job got failed,
>> and I get the following exception, it seems the spark try to deserialize
>> the checkpoint failed.
>> My question is whether the logic in the jar will be serialized into
>> checkpoint, and is it possible to do the jar reloading during runtime in
>> Spark Streaming?
>>
>>
>> [2017-06-06 17:13:12,185] WARN Lost task 1.0 in stage 5355.0 (TID 4817, 
>> ip-10-21-14-205.envisioncn.com): java.lang.ClassCastException: cannot assign 
>> instance of scala.collection.immutable.List$SerializationProxy to field 
>> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
>> scala.collection.Seq in instance of 
>> org.apache.spark.streaming.rdd.MapWithStateRDD
>>  at 
>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>>  at 
>> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at 

Re: Convert the feature vector to raw data

2017-06-07 Thread Ryan
if you use StringIndexer to category the data, IndexToString could convert
it back.

On Wed, Jun 7, 2017 at 6:14 PM, kundan kumar  wrote:

> Hi Yan,
>
> This doesnt work.
>
> thanks,
> kundan
>
> On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) 
> wrote:
>
>> Hi, kumar.
>>
>> How about removing the `select` in your code?
>> namely,
>>
>> Dataset result = model.transform(testData);
>> result.show(1000, false);
>>
>>
>>
>>
>> On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar 
>> wrote:
>>
>>> I am using
>>>
>>> Dataset result = model.transform(testData).select("probability",
>>> "label","features");
>>>  result.show(1000, false);
>>>
>>> In this case the feature vector is being printed as output. Is there a
>>> way that my original raw data gets printed instead of the feature vector OR
>>> is there a way to reverse extract my raw data from the feature vector. All
>>> of the features that my dataset have is categorical in nature.
>>>
>>> Thanks,
>>> Kundan
>>>
>>
>>
>


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi Takeshi, Jörn Franke,

The problem is even I increase the maxColumns it still have some lines have
larger columns than the one I set and it will cost a lot of memory.
So I just wanna skip the line has larger columns than the maxColumns I set.

Regards,
Chanh


On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro 
wrote:

> Is it not enough to set `maxColumns` in CSV options?
>
>
> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>
> // maropu
>
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:
>
>> Spark CSV data source should be able
>>
>> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>>
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more columns
>> than maxColumns (default is 20480). The process of parsing was stop.
>>
>> Internal state when error was thrown: line=1, column=3, record=0,
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException:
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>> Use settings.setMaxColumns(int) to define the maximum number of columns
>> your input can have
>> Ensure your configuration is correct, with delimiters, quotes and escape
>> sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>>
>>
>> I did some investigation in univocity
>>  library but the way it
>> handle is throw error that why spark stop the process.
>>
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>>
>> Thanks & regards,
>> Chanh
>> --
>> Regards,
>> Chanh
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh


Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Matt Tenenbaum
A lot depends on your context as well. If I'm using Spark _for analysis_, I
frequently use python; it's a starting point, from which I can then
leverage pandas, matplotlib/seaborn, and other powerful tools available on
top of python.

If the Spark outputs are the ends themselves, rather than the means to
further exploration, Scala still feels like the "first class"
language---most thorough feature set, best debugging support, etc.

More crudely: if the eventual goal is a dataset, I tend to prefer Scala; if
it's a visualization or some summary values, I tend to prefer Python.

Of course, I also agree that this is more theological than technical.
Appropriately size your grains of salt.

Cheers
-mt

On Wed, Jun 7, 2017 at 12:39 PM, Bryan Jeffrey 
wrote:

> Mich,
>
> We use Scala for a large project.  On our team we've set a few standards
> to ensure readability (we try to avoid excessive use of tuples, use named
> functions, etc.)  Given these constraints, I find Scala to be very
> readable, and far easier to use than Java.  The Lambda functionality of
> Java provides a lot of similar features, but the amount of typing required
> to set down a small function is excessive at best!
>
> Regards,
>
> Bryan Jeffrey
>
> On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke  wrote:
>
>> I think this is a religious question ;-)
>> Java is often underestimated, because people are not aware of its lambda
>> functionality which makes the code very readable. Scala - it depends who
>> programs it. People coming with the normal Java background write Java-like
>> code in scala which might not be so good. People from a functional
>> background write it more functional like - i.e. You have a lot of things in
>> one line of code which can be a curse even for other functional
>> programmers, especially if the application is distributed as in the case of
>> Spark. Usually no comment is provided and you have - even as a functional
>> programmer - to do a lot of drill down. Python is somehow similar, but
>> since it has no connection with Java you do not have these extremes. There
>> it depends more on the community (e.g. Medical, financials) and skills of
>> people how the code look likes.
>> However the difficulty comes with the distributed applications behind
>> Spark which may have unforeseen side effects if the users do not know this,
>> ie if they have never been used to parallel programming.
>>
>> On 7. Jun 2017, at 17:20, Mich Talebzadeh 
>> wrote:
>>
>>
>> Hi,
>>
>> I am a fan of Scala and functional programming hence I prefer Scala.
>>
>> I had a discussion with a hardcore Java programmer and a data scientist
>> who prefers Python.
>>
>> Their view is that in a collaborative work using Scala programming it is
>> almost impossible to understand someone else's Scala code.
>>
>> Hence I was wondering how much truth is there in this statement. Given
>> that Spark uses Scala as its core development language, what is the general
>> view on the use of Scala, Python or Java?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Bryan Jeffrey
Mich,

We use Scala for a large project.  On our team we've set a few standards to
ensure readability (we try to avoid excessive use of tuples, use named
functions, etc.)  Given these constraints, I find Scala to be very
readable, and far easier to use than Java.  The Lambda functionality of
Java provides a lot of similar features, but the amount of typing required
to set down a small function is excessive at best!

Regards,

Bryan Jeffrey

On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke  wrote:

> I think this is a religious question ;-)
> Java is often underestimated, because people are not aware of its lambda
> functionality which makes the code very readable. Scala - it depends who
> programs it. People coming with the normal Java background write Java-like
> code in scala which might not be so good. People from a functional
> background write it more functional like - i.e. You have a lot of things in
> one line of code which can be a curse even for other functional
> programmers, especially if the application is distributed as in the case of
> Spark. Usually no comment is provided and you have - even as a functional
> programmer - to do a lot of drill down. Python is somehow similar, but
> since it has no connection with Java you do not have these extremes. There
> it depends more on the community (e.g. Medical, financials) and skills of
> people how the code look likes.
> However the difficulty comes with the distributed applications behind
> Spark which may have unforeseen side effects if the users do not know this,
> ie if they have never been used to parallel programming.
>
> On 7. Jun 2017, at 17:20, Mich Talebzadeh 
> wrote:
>
>
> Hi,
>
> I am a fan of Scala and functional programming hence I prefer Scala.
>
> I had a discussion with a hardcore Java programmer and a data scientist
> who prefers Python.
>
> Their view is that in a collaborative work using Scala programming it is
> almost impossible to understand someone else's Scala code.
>
> Hence I was wondering how much truth is there in this statement. Given
> that Spark uses Scala as its core development language, what is the general
> view on the use of Scala, Python or Java?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Takeshi Yamamuro
Is it not enough to set `maxColumns` in CSV options?

https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116

// maropu

On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke  wrote:

> Spark CSV data source should be able
>
> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
>
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more columns
> than maxColumns (default is 20480). The process of parsing was stop.
>
> Internal state when error was thrown: line=1, column=3, record=0,
> charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException
> - 2
> Hint: Number of columns processed may have exceeded limit of 2 columns.
> Use settings.setMaxColumns(int) to define the maximum number of columns
> your input can have
> Ensure your configuration is correct, with delimiters, quotes and escape
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
>
>
> I did some investigation in univocity
>  library but the way it
> handle is throw error that why spark stop the process.
>
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
>
> Thanks & regards,
> Chanh
> --
> Regards,
> Chanh
>
>


-- 
---
Takeshi Yamamuro


Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Jörn Franke
I think this is a religious question ;-) 
Java is often underestimated, because people are not aware of its lambda 
functionality which makes the code very readable. Scala - it depends who 
programs it. People coming with the normal Java background write Java-like code 
in scala which might not be so good. People from a functional background write 
it more functional like - i.e. You have a lot of things in one line of code 
which can be a curse even for other functional programmers, especially if the 
application is distributed as in the case of Spark. Usually no comment is 
provided and you have - even as a functional programmer - to do a lot of drill 
down. Python is somehow similar, but since it has no connection with Java you 
do not have these extremes. There it depends more on the community (e.g. 
Medical, financials) and skills of people how the code look likes.
However the difficulty comes with the distributed applications behind Spark 
which may have unforeseen side effects if the users do not know this, ie if 
they have never been used to parallel programming.

> On 7. Jun 2017, at 17:20, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> I am a fan of Scala and functional programming hence I prefer Scala.
> 
> I had a discussion with a hardcore Java programmer and a data scientist who 
> prefers Python.
> 
> Their view is that in a collaborative work using Scala programming it is 
> almost impossible to understand someone else's Scala code.
> 
> Hence I was wondering how much truth is there in this statement. Given that 
> Spark uses Scala as its core development language, what is the general view 
> on the use of Scala, Python or Java?
> 
> Thanks,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
Spark CSV data source should be able

> On 7. Jun 2017, at 17:50, Chanh Le  wrote:
> 
> Hi everyone,
> I am using Spark 2.1.1 to read csv files and convert to avro files.
> One problem that I am facing is if one row of csv file has more columns than 
> maxColumns (default is 20480). The process of parsing was stop.
> 
> Internal state when error was thrown: line=1, column=3, record=0, charIndex=12
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException - 2
> Hint: Number of columns processed may have exceeded limit of 2 columns. Use 
> settings.setMaxColumns(int) to define the maximum number of columns your 
> input can have
> Ensure your configuration is correct, with delimiters, quotes and escape 
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
> 
> 
> I did some investigation in univocity library but the way it handle is throw 
> error that why spark stop the process.
> 
> How to skip the invalid row and just continue to parse next valid one?
> Any libs can replace univocity in that job?
> 
> Thanks & regards,
> Chanh
> -- 
> Regards,
> Chanh


user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org

 

From: kundan kumar [mailto:iitr.kun...@gmail.com] 
Sent: Wednesday, June 7, 2017 5:15 AM
To: 颜发才(Yan Facai) 
Cc: spark users 
Subject: Re: Convert the feature vector to raw data

 

Hi Yan, 

 

This doesnt work.

 

thanks,

kundan

 

On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai)  > wrote:

Hi, kumar.

How about removing the `select` in your code?

namely,

Dataset result = model.transform(testData);

result.show(1000, false);





 

On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar  > wrote:

I am using 

 

Dataset result = model.transform(testData).select("probability", 
"label","features");

 result.show(1000, false);

 

In this case the feature vector is being printed as output. Is there a way that 
my original raw data gets printed instead of the feature vector OR is there a 
way to reverse extract my raw data from the feature vector. All of the features 
that my dataset have is categorical in nature.

 

Thanks,

Kundan

 

 



user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org

 

From: 颜发才(Yan Facai) [mailto:facai@gmail.com] 
Sent: Wednesday, June 7, 2017 4:24 AM
To: kundan kumar 
Cc: spark users 
Subject: Re: Convert the feature vector to raw data

 

Hi, kumar.

How about removing the `select` in your code?

namely,

Dataset result = model.transform(testData);

result.show(1000, false);





 

On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar  > wrote:

I am using 

 

Dataset result = model.transform(testData).select("probability", 
"label","features");

 result.show(1000, false);

 

In this case the feature vector is being printed as output. Is there a way that 
my original raw data gets printed instead of the feature vector OR is there a 
way to reverse extract my raw data from the feature vector. All of the features 
that my dataset have is categorical in nature.

 

Thanks,

Kundan

 



user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org

user-unsubscr...@spark.apache.org

From: kundan kumar [mailto:iitr.kun...@gmail.com] 
Sent: Wednesday, June 7, 2017 4:01 AM
To: spark users 
Subject: Convert the feature vector to raw data

 

I am using 

 

Dataset result = model.transform(testData).select("probability", 
"label","features");

 result.show(1000, false);

 

In this case the feature vector is being printed as output. Is there a way that 
my original raw data gets printed instead of the feature vector OR is there a 
way to reverse extract my raw data from the feature vector. All of the features 
that my dataset have is categorical in nature.

 

Thanks,

Kundan



[CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi everyone,
I am using Spark 2.1.1 to read csv files and convert to avro files.
One problem that I am facing is if one row of csv file has more columns
than maxColumns (default is 20480). The process of parsing was stop.

Internal state when error was thrown: line=1, column=3, record=0,
charIndex=12
com.univocity.parsers.common.TextParsingException:
java.lang.ArrayIndexOutOfBoundsException - 2
Hint: Number of columns processed may have exceeded limit of 2 columns. Use
settings.setMaxColumns(int) to define the maximum number of columns your
input can have
Ensure your configuration is correct, with delimiters, quotes and escape
sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:


I did some investigation in univocity
 library but the way it
handle is throw error that why spark stop the process.

How to skip the invalid row and just continue to parse next valid one?
Any libs can replace univocity in that job?

Thanks & regards,
Chanh
-- 
Regards,
Chanh


Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-07 Thread swetha kasireddy
I changed the datastructure to scala.collection.immutable.Set and I still
see the same issue. My key is a String.  I do the following in my reduce
and invReduce.

visitorSet1 ++visitorSet2.toTraversable


visitorSet1 --visitorSet2.toTraversable

On Tue, Jun 6, 2017 at 8:22 PM, Tathagata Das 
wrote:

> Yes, and in general any mutable data structure. You have to immutable data
> structures whose hashcode and equals is consistent enough for being put in
> a set.
>
> On Jun 6, 2017 4:50 PM, "swetha kasireddy" 
> wrote:
>
>> Are you suggesting against the usage of HashSet?
>>
>> On Tue, Jun 6, 2017 at 3:36 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> This may be because of HashSet is a mutable data structure, and it seems
>>> you are actually mutating it in "set1 ++set2". I suggest creating a new
>>> HashMap in the function (and add both maps into it), rather than mutating
>>> one of them.
>>>
>>> On Tue, Jun 6, 2017 at 11:30 AM, SRK  wrote:
>>>
 Hi,

 I see the following error when I use ReduceByKeyAndWindow in my Spark
 Streaming app. I use reduce, invReduce and filterFunction as shown
 below.
 Any idea as to why I get the error?

  java.lang.Exception: Neither previous window has value for key, nor new
 values found. Are you sure your key class hashes consistently?


   def reduceWithHashSet: ((Long, HashSet[String]), (Long,
 HashSet[String]))
 => (Long, HashSet[String])= {
 case ((timeStamp1: Long, set1: HashSet[String]), (timeStamp2: Long,
 set2: HashSet[String])) => (Math.max(timeStamp1, timeStamp2), set1
 ++set2 )

   }

   def invReduceWithHashSet: ((Long, HashSet[String]), (Long,
 HashSet[String])) => (Long, HashSet[String])= {
 case ((timeStamp1: Long, set1: HashSet[String]), (timeStamp2: Long,
 set2: HashSet[String])) => (Math.max(timeStamp1, timeStamp2),
 set1.diff(set2))
   }

   def filterFuncForInvReduce: ((String, (Long, HashSet[String]))) =>
 (Boolean)= {
 case ((metricName:String, (timeStamp: Long, set: HashSet[String])))
 =>
 set.size>0
   }






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Exception-which-using-ReduceByKeyAndWi
 ndow-in-Spark-Streaming-tp28748.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>


Scala, Python or Java for Spark programming

2017-06-07 Thread Mich Talebzadeh
Hi,

I am a fan of Scala and functional programming hence I prefer Scala.

I had a discussion with a hardcore Java programmer and a data scientist who
prefers Python.

Their view is that in a collaborative work using Scala programming it is
almost impossible to understand someone else's Scala code.

Hence I was wondering how much truth is there in this statement. Given that
Spark uses Scala as its core development language, what is the general view
on the use of Scala, Python or Java?

Thanks,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: problem initiating spark context with pyspark

2017-06-07 Thread Curtis Burkhalter
Thanks Doc I saw this on another board yesterday so I've tried this by
first going to the directory where I've stored the wintutils.exe and then
as an admin running the command  that you suggested and I get this
exception when checking the permissions:

C:\winutils\bin>winutils.exe ls -F C:\tmp\hive
FindFileOwnerAndPermission error (1789): The trust relationship between
this workstation and the primary domain failed.

I'm fairly new to the command line and determining what the different
exceptions mean. Do you have any advice what this error means and how I
might go about fixing this?

Thanks again


On Wed, Jun 7, 2017 at 9:51 AM, Doc Dwarf  wrote:

> Hi Curtis,
>
> I believe in windows, the following command needs to be executed: (will
> need winutils installed)
>
> D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive
>
>
>
> On 6 June 2017 at 09:45, Curtis Burkhalter 
> wrote:
>
>> Hello all,
>>
>> I'm new to Spark and I'm trying to interact with it using Pyspark. I'm
>> using the prebuilt version of spark v. 2.1.1 and when I go to the command
>> line and use the command 'bin\pyspark' I have initialization problems and
>> get the following message:
>>
>> C:\spark\spark-2.1.1-bin-hadoop2.7> bin\pyspark
>> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41)
>> [MSC v.1900 64 bit (AMD64)] on win32
>> Type "help", "copyright", "credits" or "license" for more information.
>> Using Spark's default log4j profile: org/apache/spark/log4j-default
>> s.properties
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>> setLogLevel(newLevel).
>> 17/06/06 10:30:14 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 17/06/06 10:30:21 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/06/06 10:30:21 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> Traceback (most recent call last):
>>   File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py",
>> line 63, in deco
>> return f(*a, **kw)
>>   File 
>> "C:\spark\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py",
>> line 319, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o22.sessionState.
>> : java.lang.IllegalArgumentException: Error while instantiating
>> 'org.apache.spark.sql.hive.HiveSessionState':
>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$
>> SparkSession$$reflect(SparkSession.scala:981)
>> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(
>> SparkSession.scala:110)
>> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.
>> scala:109)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.
>> java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>> ava:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>> 23)
>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$
>> SparkSession$$reflect(SparkSession.scala:978)
>> ... 13 more
>> Caused by: java.lang.IllegalArgumentException: Error while instantiating
>> 'org.apache.spark.sql.hive.HiveExternalCatalog':
>> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
>> sql$internal$SharedState$$reflect(SharedState.scala:169)
>> at org.apache.spark.sql.internal.SharedState.(SharedState
>> .scala:86)
>> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.
>> apply(SparkSession.scala:101)
>> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.
>> apply(SparkSession.scala:101)
>> at scala.Option.getOrElse(Option.scala:121)
>> at 

Re: problem initiating spark context with pyspark

2017-06-07 Thread Doc Dwarf
Hi Curtis,

I believe in windows, the following command needs to be executed: (will
need winutils installed)

D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive



On 6 June 2017 at 09:45, Curtis Burkhalter 
wrote:

> Hello all,
>
> I'm new to Spark and I'm trying to interact with it using Pyspark. I'm
> using the prebuilt version of spark v. 2.1.1 and when I go to the command
> line and use the command 'bin\pyspark' I have initialization problems and
> get the following message:
>
> C:\spark\spark-2.1.1-bin-hadoop2.7> bin\pyspark
> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41)
> [MSC v.1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: org/apache/spark/log4j-
> defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> 17/06/06 10:30:14 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 17/06/06 10:30:21 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/06/06 10:30:21 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> Traceback (most recent call last):
>   File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.
> 10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o22.sessionState.
> : java.lang.IllegalArgumentException: Error while instantiating
> 'org.apache.spark.sql.hive.HiveSessionState':
> at org.apache.spark.sql.SparkSession$.org$apache$
> spark$sql$SparkSession$$reflect(SparkSession.scala:981)
> at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:110)
> at org.apache.spark.sql.SparkSession.sessionState(
> SparkSession.scala:109)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(
> ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
> java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.spark.sql.SparkSession$.org$apache$
> spark$sql$SparkSession$$reflect(SparkSession.scala:978)
> ... 13 more
> Caused by: java.lang.IllegalArgumentException: Error while instantiating
> 'org.apache.spark.sql.hive.HiveExternalCatalog':
> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
> sql$internal$SharedState$$reflect(SharedState.scala:169)
> at org.apache.spark.sql.internal.SharedState.(
> SharedState.scala:86)
> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
> SparkSession.scala:101)
> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
> SparkSession.scala:101)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession.sharedState$
> lzycompute(SparkSession.scala:101)
> at org.apache.spark.sql.SparkSession.sharedState(
> SparkSession.scala:100)
> at org.apache.spark.sql.internal.SessionState.(
> SessionState.scala:157)
> at org.apache.spark.sql.hive.HiveSessionState.(
> HiveSessionState.scala:32)
> ... 18 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Patrik Medvedev
No, I don't.

ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin :

> Do you have some other security in place like Kerberos or impersonation?
> It may affect your access.
>
>
> jg
>
>
> On Jun 7, 2017, at 02:15, Patrik Medvedev 
> wrote:
>
> Hello guys,
>
> I need to execute hive queries on remote hive server from spark, but for
> some reasons i receive only column names(without data).
> Data available in table, i checked it via HUE and java jdbc connection.
>
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
>
>
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
>
> But this problem available on Hive with later versions, too.
> Could you help me with this issue, because i didn't find anything in mail
> group answers and StackOverflow.
> Or could you help me find correct solution how to query remote hive from
> spark?
>
> --
> *Cheers,*
> *Patrick*
>
>


Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Jean Georges Perrin
Do you have some other security in place like Kerberos or impersonation? It may 
affect your access. 

jg


> On Jun 7, 2017, at 02:15, Patrik Medvedev  wrote:
> 
> Hello guys,
> 
> I need to execute hive queries on remote hive server from spark, but for some 
> reasons i receive only column names(without data).
> Data available in table, i checked it via HUE and java jdbc connection.
> 
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
> 
> 
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
> 
> But this problem available on Hive with later versions, too.
> Could you help me with this issue, because i didn't find anything in mail 
> group answers and StackOverflow.
> Or could you help me find correct solution how to query remote hive from 
> spark?
> 
> -- 
> Cheers,
> Patrick


Re: Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
Hi Yan,

This doesnt work.

thanks,
kundan

On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai)  wrote:

> Hi, kumar.
>
> How about removing the `select` in your code?
> namely,
>
> Dataset result = model.transform(testData);
> result.show(1000, false);
>
>
>
>
> On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar 
> wrote:
>
>> I am using
>>
>> Dataset result = model.transform(testData).select("probability",
>> "label","features");
>>  result.show(1000, false);
>>
>> In this case the feature vector is being printed as output. Is there a
>> way that my original raw data gets printed instead of the feature vector OR
>> is there a way to reverse extract my raw data from the feature vector. All
>> of the features that my dataset have is categorical in nature.
>>
>> Thanks,
>> Kundan
>>
>
>


Re: Convert the feature vector to raw data

2017-06-07 Thread Yan Facai
Hi, kumar.

How about removing the `select` in your code?
namely,

Dataset result = model.transform(testData);
result.show(1000, false);




On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar  wrote:

> I am using
>
> Dataset result = model.transform(testData).select("probability",
> "label","features");
>  result.show(1000, false);
>
> In this case the feature vector is being printed as output. Is there a way
> that my original raw data gets printed instead of the feature vector OR is
> there a way to reverse extract my raw data from the feature vector. All of
> the features that my dataset have is categorical in nature.
>
> Thanks,
> Kundan
>


[Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Patrik Medvedev
Hello guys,

I need to execute hive queries on remote hive server from spark, but for
some reasons i receive only column names(without data).
Data available in table, i checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
Could you help me with this issue, because i didn't find anything in mail
group answers and StackOverflow.
Or could you help me find correct solution how to query remote hive from
spark?

-- 
*Cheers,*
*Patrick*


[no subject]

2017-06-07 Thread Patrik Medvedev
Hello guys,

I need to execute hive queries on remote hive server from spark, but for
some reasons i receive only column names(without data).
Data available in table, i checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
Could you help me with this issue, because i didn't find anything in mail
group answers and StackOverflow.

-- 
*Cheers,*
*Patrick*


Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
I am using

Dataset result = model.transform(testData).select("probability",
"label","features");
 result.show(1000, false);

In this case the feature vector is being printed as output. Is there a way
that my original raw data gets printed instead of the feature vector OR is
there a way to reverse extract my raw data from the feature vector. All of
the features that my dataset have is categorical in nature.

Thanks,
Kundan


Re: Java SPI jar reload in Spark

2017-06-07 Thread Jonnas Li(Contractor)
To be more explicit, I used mapwithState() in my application, just like this:

stream = KafkaUtils.createStream(..)
mappedStream = stream.mapPartitionToPair(..)
stateStream = mappedStream.mapwithState(MyUpdateFunc(..))
stateStream.foreachRDD(..)

I call the jar in MyUpdateFunc(), and the jar reloading is triggered by some 
other event.

I'm not sure if this approach is feasible. To my understand, Spark will 
checkpoint the status, so the application can’t be updated at runtime, that’s 
why I got the exception.

Any suggestion is welcome, if there is any other idea to do something like 
this, I just want to provide a approach to enable users can customize for their 
business logic.

Regards
李忠双 / Jonnas

发件人: Zhongshuang Li 
>
日期: 2017年6月6日 星期二 下午6:30
至: Alonso Isidoro Roman >
抄送: Jörn Franke >, 
"user@spark.apache.org" 
>
主题: Re: Java SPI jar reload in Spark

I used 
java.util.ServiceLoader
  , as the javadoc says it supports reloading.
Please point it out if I'm mis-understanding.

Regards
Jonnas Li

发件人: Alonso Isidoro Roman >
日期: 2017年6月6日 星期二 下午6:21
至: Zhongshuang Li 
>
抄送: Jörn Franke >, 
"user@spark.apache.org" 
>
主题: Re: Java SPI jar reload in Spark

Hi, a quick search on google.

https://github.com/spark-jobserver/spark-jobserver/issues/130


Alonso Isidoro Roman
about.me/alonso.isidoro.roman


2017-06-06 12:14 GMT+02:00 Jonnas Li(Contractor) 
>:
Thank for your quick response.
These jars are used to define some customize business logic, and they can be 
treat as plug-ins in our business scenario. And the jars are developed/maintain 
by some third-party partners, this means there will be some version updating.
My expectation is update the business code with restarting the spark streaming 
job, any suggestion will be very grateful.

Regards
Jonnas Li

发件人: Jörn Franke >
日期: 2017年6月6日 星期二 下午5:55
至: Zhongshuang Li 
>
抄送: "user@spark.apache.org" 
>
主题: Re: Java SPI jar reload in Spark

Why do you need jar reloading? What functionality is executed during jar 
reloading. Maybe there is another way to achieve the same without jar 
reloading. In fact, it might be dangerous from a functional point of view- 
functionality in jar changed and all your computation is wrong.

On 6. Jun 2017, at 11:35, Jonnas Li(Contractor) 
> wrote:

I have a Spark Streaming application, which dynamically calling a jar (Java 
SPI), and the jar is called in a mapWithState() function, it was working fine 
for a long time.
Recently, I got a requirement which required to reload the jar during runtime.
But when the reloading is completed, the spark streaming job got failed, and I 
get the following exception, it seems the spark try to deserialize the 
checkpoint failed.
My question is whether the logic in the jar will be serialized into checkpoint, 
and is it possible to do the jar reloading during runtime in Spark Streaming?



[2017-06-06 17:13:12,185] WARN Lost task 1.0 in stage 5355.0 (TID 4817, 
ip-10-21-14-205.envisioncn.com): 
java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.List$SerializationProxy to field 
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_
 of type scala.collection.Seq in instance of 
org.apache.spark.streaming.rdd.MapWithStateRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 

Re: Edge Node in Spark

2017-06-07 Thread Mich Talebzadeh
Agreed with Ayan.

Essentially an Edge node is a physical host or VM that is used by the
application to run the job. The users or service users start the process
from the Edge node. Edge nodes are added to the cluster for example
DEV/TEST/UAT etc.

Edge node normally has all compatible binaries in this case $KAFKA_HOME
installation to run Spark jobs. These binaries are normally installed
independently (open source) or through parcels say CDH when the cluster
admin adds the node to the cluster.

I have not seen anyone allowed to directly log in to datanodes, namenodes
or kafka nodes (if they are on different Hardware than datanodes etc) to
kijck off sparksubmit. it is all done through edge or network servers.  For
example, you may have 8 Spark processes are running on 8 datanodes say
standalone mode one master and multiple worker processes say two on each
datanode. However, no workere process will be running on edge node.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 June 2017 at 22:42, ayan guha  wrote:

> They are all same thing. Essentially it means a machine which is not part
> of the cluster butHas all clients.
>
> On Wed, 7 Jun 2017 at 5:48 am, Irving Duran 
> wrote:
>
>> Where in the documentation did you find "edge node"? Spark would call it
>> worker or executor, but not "edge node".  Her is some info about yarn logs
>> -> https://spark.apache.org/docs/latest/running-on-yarn.html.
>>
>>
>> Thank You,
>>
>> Irving Duran
>>
>> On Tue, Jun 6, 2017 at 11:48 AM, Ashok Kumar 
>> wrote:
>>
>>> Just Straight Spark please.
>>>
>>> Also if I run a spark job using Python or Scala using Yarn where the log
>>> files are kept in the edge node?  Are these under logs directory for yarn?
>>>
>>> thanks
>>>
>>>
>>> On Tuesday, 6 June 2017, 14:11, Irving Duran 
>>> wrote:
>>>
>>>
>>> Ashok,
>>> Are you working with straight spark or referring to GraphX?
>>>
>>>
>>> Thank You,
>>>
>>> Irving Duran
>>>
>>> On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <
>>> ashok34...@yahoo.com.invalid> wrote:
>>>
>>> Hi,
>>>
>>> I am a bit confused between Edge node, Edge server and gateway node in
>>> Spark.
>>>
>>> Do these mean the same thing?
>>>
>>> How does one set up an Edge node to be used in Spark? Is this different
>>> from Edge node for Hadoop please?
>>>
>>> Thanks
>>>
>>> -- -- -
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache. org
>>> 
>>>
>>>
>>>
>>>
>>>
>> --
> Best Regards,
> Ayan Guha
>