RE: pyspark in intellij

2017-02-25 Thread Sidney Feiner
Yes, I got it working once but I can't exactly remember how. I think what I did was the following: · To the environment variables, add a variable named PYTHONPATH with the path to your pyspark python directory (in my case, C:\spark-2.1.0-bin-hadoop2.7\python) · To the

In Spark streaming, will saved kafka offsets become invalid if I change the number of partitions in a kafka topic?

2017-02-25 Thread shyla deshpande
I am commiting offsets to Kafka after my output has been stored, using the commitAsync API. My question is if I increase/decrease the number of kafka partitions, will the saved offsets will become invalid. Thanks

Spark test error in ProactiveClosureSerializationSuite.scala

2017-02-25 Thread ??????????
hello all, I am building Spark1.6.2 and I meet a problem when doing mvn test The command is mvn -e -Pyarn -Phive -Phive-thriftserver -DwildcardSuites=org.apache.spark.serializer.ProactiveClosureSerializationSuite test and the test error is ProactiveClosureSerializationSuite: - throws

Spark SQL table authority control?

2017-02-25 Thread 李斌松
Through the JDBC connection spark thriftserver, execte hive SQL, check whether the table read or write permission to expand hook in hive on spark, you can control permissions, spark on hive what is the point of expansion?

pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X? thanks

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Marco Mistroni
Try to use --packages to include the jars. From error it seems it's looking for main class in jars but u r running a python script... On 25 Feb 2017 10:36 pm, "Raymond Xie" wrote: That's right Anahita, however, the class name is not indicated in the original github

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you very much Marco, I am a beginner in this area, is it possible for you to show me what you think the right script should be to get it executed in terminal? ** *Sincerely yours,* *Raymond* On Sat, Feb 25, 2017 at 6:00 PM, Marco Mistroni

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
That's right Anahita, however, the class name is not indicated in the original github project so I don't know what class should be used here. The github only says: and then run the example `$ bin/spark-submit --jars \ external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Anahita Talebi
You're welcome. You need to specify the class. I meant like that: spark-submit /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0. 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class" On Saturday, February 25, 2017, Raymond Xie wrote: >

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you, it is still not working: [image: Inline image 1] By the way, here is the original source: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py ** *Sincerely yours,* *Raymond* On Sat, Feb

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Anahita Talebi
Hi, I think if you remove --jars, it will work. Like: spark-submit /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5. 0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar I had the same problem before and solved it by removing --jars. Cheers, Anahita On Saturday, February 25, 2017, Raymond Xie

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread yohann jardin
You should read (again?) the Spark documentation about submitting an application: http://spark.apache.org/docs/latest/submitting-applications.html Try with the Pi computation example available with Spark. For example: ./bin/spark-submit --class org.apache.spark.examples.SparkPi

No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
I am doing a spark streaming on a hortonworks sandbox and am stuck here now, can anyone tell me what's wrong with the following code and the exception it causes and how do I fix it? Thank you very much in advance. spark-submit --jars

PySpark + virtualenv: Using a different python path on the driver and on the executors

2017-02-25 Thread Tomer Benyamini
Hello, I'm trying to run pyspark using the following setup: - spark 1.6.1 standalone cluster on ec2 - virtualenv installed on master - app is run using the following command: export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python export PYSPARK_PYTHON=/usr/bin/python

Spark runs out of memory with small file

2017-02-25 Thread Henry Tremblay
I am reading in a single small file from hadoop with wholeText. If I process each line and create a row with two cells, the first cell equal to the name of the file, the second cell equal to the line. That code runs fine. But if I just add two line of code and change the first cell based on

Re: Get S3 Parquet File

2017-02-25 Thread Steve Loughran
On 24 Feb 2017, at 07:47, Femi Anthony > wrote: Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. I would absolutely not use s3n with a 1.2 GB file. There is a

Re: RDD blocks on Spark Driver

2017-02-25 Thread liangyhg...@gmail.com
Hi, I think you are using the local model of Spark. There are mainly four models, which are local, standalone,  yarn and Mesos. Also, "blocks" is relative to hdfs, "partitions" is relative to spark.liangyihuai---Original---From: "Jacek Laskowski "Date: 2017/2/25 02:45:20To:

instrumenting Spark hit ratios

2017-02-25 Thread Mich Talebzadeh
One of the ways of ingesting data into HDFS is to use Spark JDBC connection to connect to soured and ingest data into the underlying files or Hive tables. One question has come out is under controlled test conditions what would the measurements of io, cpu etc across the cluster. Assuming not