SPARK-19547

2017-06-07 Thread Rastogi, Pankaj
Hi, I have been trying to distribute Kafka topics among different instances of same consumer group. I am using KafkaDirectStream API for creating DStreams. After the second consumer group comes up, Kafka does partition rebalance and then Spark driver of the first consumer dies with the

Read Data From NFS

2017-06-07 Thread ayan guha
Hi Guys Quick one: How spark deals (ie create partitions) with large files sitting on NFS, assuming the all executors can see the file exactly same way. ie, when I run r = sc.textFile("file://my/file") what happens if the file is on NFS? is there any difference from r =

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
The CSV data source allows you to skip invalid lines - this should also include lines that have more than maxColumns. Choose mode "DROPMALFORMED" > On 8. Jun 2017, at 03:04, Chanh Le wrote: > > Hi Takeshi, Jörn Franke, > > The problem is even I increase the maxColumns it

Re: No TypeTag Available for String

2017-06-07 Thread Ryan
did you include the proper scala-reflect dependency? On Wed, May 31, 2017 at 1:01 AM, krishmah wrote: > I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works > with Scala 2.10.6. Please advise if I am missing something > > import

Re: Worker node log not showed

2017-06-07 Thread Ryan
I think you need to get the logger within the lambda, otherwise it's the logger on driver side which can't work. On Wed, May 31, 2017 at 4:48 PM, Paolo Patierno wrote: > No it's running in standalone mode as Docker image on Kubernetes. > > > The only way I found was to

Re: good http sync client to be used with spark

2017-06-07 Thread Ryan
we use AsyncHttpClient(from the java world) and simply call future.get as synchronous call. On Thu, Jun 1, 2017 at 4:08 AM, vimal dinakaran wrote: > Hi, > In our application pipeline we need to push the data from spark streaming > to a http server. > > I would like to have

Re: Question about mllib.recommendation.ALS

2017-06-07 Thread Ryan
1. could you give job, stage & task status from Spark UI? I found it extremely useful for performance tuning. 2. use modele.transform for predictions. Usually we have a pipeline for preparing training data, and use the same pipeline to transform data you want to predict could give us the

Re: Java SPI jar reload in Spark

2017-06-07 Thread Ryan
I'd suggest scripts like js, groovy, etc.. To my understanding the service loader mechanism isn't a good fit for runtime reloading. On Wed, Jun 7, 2017 at 4:55 PM, Jonnas Li(Contractor) < zhongshuang...@envisioncn.com> wrote: > To be more explicit, I used mapwithState() in my application, just

Re: Convert the feature vector to raw data

2017-06-07 Thread Ryan
if you use StringIndexer to category the data, IndexToString could convert it back. On Wed, Jun 7, 2017 at 6:14 PM, kundan kumar wrote: > Hi Yan, > > This doesnt work. > > thanks, > kundan > > On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) > wrote: >

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi Takeshi, Jörn Franke, The problem is even I increase the maxColumns it still have some lines have larger columns than the one I set and it will cost a lot of memory. So I just wanna skip the line has larger columns than the maxColumns I set. Regards, Chanh On Thu, Jun 8, 2017 at 12:48 AM

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Matt Tenenbaum
A lot depends on your context as well. If I'm using Spark _for analysis_, I frequently use python; it's a starting point, from which I can then leverage pandas, matplotlib/seaborn, and other powerful tools available on top of python. If the Spark outputs are the ends themselves, rather than the

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Bryan Jeffrey
Mich, We use Scala for a large project. On our team we've set a few standards to ensure readability (we try to avoid excessive use of tuples, use named functions, etc.) Given these constraints, I find Scala to be very readable, and far easier to use than Java. The Lambda functionality of Java

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Takeshi Yamamuro
Is it not enough to set `maxColumns` in CSV options? https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116 // maropu On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke wrote: > Spark CSV data

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Jörn Franke
I think this is a religious question ;-) Java is often underestimated, because people are not aware of its lambda functionality which makes the code very readable. Scala - it depends who programs it. People coming with the normal Java background write Java-like code in scala which might not be

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
Spark CSV data source should be able > On 7. Jun 2017, at 17:50, Chanh Le wrote: > > Hi everyone, > I am using Spark 2.1.1 to read csv files and convert to avro files. > One problem that I am facing is if one row of csv file has more columns than > maxColumns (default is

user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org From: kundan kumar [mailto:iitr.kun...@gmail.com] Sent: Wednesday, June 7, 2017 5:15 AM To: 颜发才(Yan Facai) Cc: spark users Subject: Re: Convert the feature vector to raw data Hi Yan, This doesnt work.

user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org From: 颜发才(Yan Facai) [mailto:facai@gmail.com] Sent: Wednesday, June 7, 2017 4:24 AM To: kundan kumar Cc: spark users Subject: Re: Convert the feature vector to raw data Hi, kumar. How about removing the

user-unsubscr...@spark.apache.org

2017-06-07 Thread williamtellme123
user-unsubscr...@spark.apache.org user-unsubscr...@spark.apache.org From: kundan kumar [mailto:iitr.kun...@gmail.com] Sent: Wednesday, June 7, 2017 4:01 AM To: spark users Subject: Convert the feature vector to raw data I am using Dataset result =

[CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi everyone, I am using Spark 2.1.1 to read csv files and convert to avro files. One problem that I am facing is if one row of csv file has more columns than maxColumns (default is 20480). The process of parsing was stop. Internal state when error was thrown: line=1, column=3, record=0,

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-07 Thread swetha kasireddy
I changed the datastructure to scala.collection.immutable.Set and I still see the same issue. My key is a String. I do the following in my reduce and invReduce. visitorSet1 ++visitorSet2.toTraversable visitorSet1 --visitorSet2.toTraversable On Tue, Jun 6, 2017 at 8:22 PM, Tathagata Das

Scala, Python or Java for Spark programming

2017-06-07 Thread Mich Talebzadeh
Hi, I am a fan of Scala and functional programming hence I prefer Scala. I had a discussion with a hardcore Java programmer and a data scientist who prefers Python. Their view is that in a collaborative work using Scala programming it is almost impossible to understand someone else's Scala

Re: problem initiating spark context with pyspark

2017-06-07 Thread Curtis Burkhalter
Thanks Doc I saw this on another board yesterday so I've tried this by first going to the directory where I've stored the wintutils.exe and then as an admin running the command that you suggested and I get this exception when checking the permissions: C:\winutils\bin>winutils.exe ls -F

Re: problem initiating spark context with pyspark

2017-06-07 Thread Doc Dwarf
Hi Curtis, I believe in windows, the following command needs to be executed: (will need winutils installed) D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive On 6 June 2017 at 09:45, Curtis Burkhalter wrote: > Hello all, > > I'm new to Spark and I'm trying to

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Patrik Medvedev
No, I don't. ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin : > Do you have some other security in place like Kerberos or impersonation? > It may affect your access. > > > jg > > > On Jun 7, 2017, at 02:15, Patrik Medvedev > wrote: > > Hello guys, > > I

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Jean Georges Perrin
Do you have some other security in place like Kerberos or impersonation? It may affect your access. jg > On Jun 7, 2017, at 02:15, Patrik Medvedev wrote: > > Hello guys, > > I need to execute hive queries on remote hive server from spark, but for some > reasons

Re: Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
Hi Yan, This doesnt work. thanks, kundan On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) wrote: > Hi, kumar. > > How about removing the `select` in your code? > namely, > > Dataset result = model.transform(testData); > result.show(1000, false); > > > > > On Wed, Jun 7,

Re: Convert the feature vector to raw data

2017-06-07 Thread Yan Facai
Hi, kumar. How about removing the `select` in your code? namely, Dataset result = model.transform(testData); result.show(1000, false); On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar wrote: > I am using > > Dataset result =

[Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Patrik Medvedev
Hello guys, I need to execute hive queries on remote hive server from spark, but for some reasons i receive only column names(without data). Data available in table, i checked it via HUE and java jdbc connection. Here is my code example: val test = spark.read .option("url",

[no subject]

2017-06-07 Thread Patrik Medvedev
Hello guys, I need to execute hive queries on remote hive server from spark, but for some reasons i receive only column names(without data). Data available in table, i checked it via HUE and java jdbc connection. Here is my code example: val test = spark.read .option("url",

Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
I am using Dataset result = model.transform(testData).select("probability", "label","features"); result.show(1000, false); In this case the feature vector is being printed as output. Is there a way that my original raw data gets printed instead of the feature vector OR is there a way to reverse

Re: Java SPI jar reload in Spark

2017-06-07 Thread Jonnas Li(Contractor)
To be more explicit, I used mapwithState() in my application, just like this: stream = KafkaUtils.createStream(..) mappedStream = stream.mapPartitionToPair(..) stateStream = mappedStream.mapwithState(MyUpdateFunc(..)) stateStream.foreachRDD(..) I call the jar in MyUpdateFunc(), and the jar

Re: Edge Node in Spark

2017-06-07 Thread Mich Talebzadeh
Agreed with Ayan. Essentially an Edge node is a physical host or VM that is used by the application to run the job. The users or service users start the process from the Edge node. Edge nodes are added to the cluster for example DEV/TEST/UAT etc. Edge node normally has all compatible binaries in