Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
Hi,

On spark 1.1.0 in Standalone mode, I am following


https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets

to try to load a simple test JSON file (on my local filesystem, not in
hdfs).
The file is below and was validated with jsonlint.com:

➜  tmp  cat test_4.json
{foo:
[{
bar: {
id: 31,
name: bar
}
},{
tux: {
id: 42,
name: tux
}
}]
}
➜  tmp  wc test_4.json
  13  19 182 test_4.json


Reading the file as text works correctly (reporting a line count of 13).

However, trying to read the file with:

scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@2c4eae94

scala val test_as_json = sqlContext.jsonFile(test_path)
...
Gets into this exception:

14/11/30 12:37:10 INFO rdd.HadoopRDD: Input split:
file:/Users/peter_v/tmp/test_4.json:0+91
14/11/30 12:37:10 INFO rdd.HadoopRDD: Input split:
file:/Users/peter_v/tmp/test_4.json:91+91
14/11/30 12:37:10 ERROR executor.Executor: Exception in task 1.0 in stage
1.0 (TID 3)
com.fasterxml.jackson.databind.JsonMappingException: Can not instantiate
value of type [map type; class java.util.LinkedHashMap, [simple type, class
java.lang.Object] - [simple type, class java.lang.Object]] from String
value; no single-String constructor/factory method
...

Wat looks strange to me is that the file of 182 characters, seems to be
split over
2 workers that take char 0+91 and 91+91 ? (Is that interpretation correct
?? That
would yield 2 half JSON files that would each be incomplete??). I presume I
am
wrong here and something else is at play.

Full log of the experiment below.

Also, I did see this thread (regarding blank lines that trigger similar
problem)

  http://find.searchhub.org/document/9aaf462d6bca027c#f294a1dd16169ba4

I validated that I have no blank lines in the input (line count = 13) and
I also did try
the filter function that is suggested there, but still get (presumably) the
same error condition.

I also did not find immediate hints that this was a resolved issue in Spark
1.1.1

  http://spark.apache.org/releases/spark-release-1-1-1.html

Thanks for any hints how to resolve this,

Peter

+++


$ bin/spark-shell
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=128m; support was removed in 8.0
14/11/30 12:34:34 INFO spark.SecurityManager: Changing view acls to:
peter_v,
14/11/30 12:34:34 INFO spark.SecurityManager: Changing modify acls to:
peter_v,
14/11/30 12:34:34 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(peter_v, ); users with modify permissions: Set(peter_v, )
14/11/30 12:34:34 INFO spark.HttpServer: Starting HTTP Server
14/11/30 12:34:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/30 12:34:34 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:63500
14/11/30 12:34:34 INFO util.Utils: Successfully started service 'HTTP class
server' on port 63500.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.1.0
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
14/11/30 12:34:37 INFO spark.SecurityManager: Changing view acls to:
peter_v,
14/11/30 12:34:37 INFO spark.SecurityManager: Changing modify acls to:
peter_v,
14/11/30 12:34:37 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(peter_v, ); users with modify permissions: Set(peter_v, )
14/11/30 12:34:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/11/30 12:34:37 INFO Remoting: Starting remoting
14/11/30 12:34:37 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@192.168.0.191:63503]
14/11/30 12:34:37 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@192.168.0.191:63503]
14/11/30 12:34:37 INFO util.Utils: Successfully started service
'sparkDriver' on port 63503.
14/11/30 12:34:37 INFO spark.SparkEnv: Registering MapOutputTracker
14/11/30 12:34:37 INFO spark.SparkEnv: Registering BlockManagerMaster
14/11/30 12:34:37 INFO storage.DiskBlockManager: Created local directory at
/var/folders/1q/3_rsfwqd4b93sj7m6rnbzj8hgn/T/spark-local-20141130123437-43b2
14/11/30 12:34:37 INFO util.Utils: Successfully started service 'Connection
manager for block manager' on port 63504.
14/11/30 12:34:37 INFO network.ConnectionManager: Bound socket to port
63504 with id = ConnectionManagerId(192.168.0.191,63504)
14/11/30 12:34:37 INFO storage.MemoryStore: MemoryStore started with
capacity 265.1 MB
14/11/30 12:34:37 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/11/30 12:34:37 INFO storage.BlockManagerMasterActor: Registering block
manager 

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi,

In the spark docs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node
it mentions However, output operations (like foreachRDD) have *at-least
once* semantics, that is, the transformed data may get written to an
external entity more than once in the event of a worker failure. 

I would like to setup a Kafka pipeline whereby I write my data to a single
topic 1, then I continue to process using spark streaming and write the
transformed results to topic2, and finally I read the results from topic 2.
How do I configure the spark streaming so that I can maintain exactly once
semantics when writing to topic 2?

Thanks,
Josh


Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the
write to Kafka?

On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote:

 I'd be interested in finding the answer too. Right now, I do:

 val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))
 kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = {
 writer.output(rec) }) } ) //where writer.ouput is a method that takes a
 string and writer is an instance of a producer class.





 On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi 
 max.toma...@gmail.com wrote:

 Hello all,
 after having applied several transformations to a DStream I'd like to
 publish all the elements in all the resulting RDDs to Kafka. What the best
 way to do that would be? Just using DStream.foreach and then RDD.foreach ?
 Is there any other built in utility for this use case?

 Thanks a lot,
 Max

 --
 
 Massimiliano Tomassi
 
 e-mail: max.toma...@gmail.com
 





Re: Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
On Sun, Nov 30, 2014 at 1:10 PM, Peter Vandenabeele pe...@vandenabeele.com
wrote:

 On spark 1.1.0 in Standalone mode, I am following


 https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets

 to try to load a simple test JSON file (on my local filesystem, not in
 hdfs).
 The file is below and was validated with jsonlint.com:

 ➜  tmp  cat test_4.json
 {foo:
 [{
 bar: {
 id: 31,
 name: bar
 }
 },{
 tux: {
 id: 42,
 name: tux
 }
 }]
 }
 ➜  tmp  wc test_4.json
   13  19 182 test_4.json


I should have read the manual better (#rtfm):

  jsonFile - loads data from a directory of JSON files where _each line of
the files is a JSON object_.  (emphasis mine)

So, what works is:

$ cat test_6.json   # == Succes  (count() = 3)
{foo:bar}
{foo:tux}
{foo:ping}

and what fails is:

$   cat test_7.json   # == Fail (JsonMappingException)
[
  {foo:bar},
  {foo:tux},
  {foo:ping}
]


I got confused by the fact that test_6.json is _not_ valid JSON (but works
for this)

and test_7.json is a _valid_ JSON array (and does not work for this).


I will see if I can contribute some note in the documentation.

In any case, it might be better to not _name_ the file in the tutorial

examples/src/main/resources/people.json

because it is actually not a valid JSON file (but a file with a JSON object
on each line)

Maybe

examples/src/main/resources/people.jsons

is a better name (equivalent to 'xs' that is used as convention in Scala).

Also, just showing an example of people.jsons could avoid future confusion.

Thanks,

Peter

-- 
Peter Vandenabeele
http://www.allthingsdata.io


Setting network variables in spark-shell

2014-11-30 Thread Brian Dolan
Howdy Folks,

What is the correct syntax in 1.0.0 to set networking variables in spark shell? 
 Specifically, I'd like to set the spark.akka.frameSize

I'm attempting this:
spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g

Only to get this within the session:

System.getProperty(spark.executor.memory)
res0: String = 4g
System.getProperty(spark.akka.frameSize)
res1: String = null

I don't believe I am violating protocol, but I have also posted this to SO: 
http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell

~~
May All Your Sequences Converge





Re: Setting network variables in spark-shell

2014-11-30 Thread Ritesh Kumar Singh
Spark configuration settings can be found here
http://spark.apache.org/docs/latest/configuration.html

Hope it helps :)

On Sun, Nov 30, 2014 at 9:55 PM, Brian Dolan buddha_...@yahoo.com.invalid
wrote:

 Howdy Folks,

 What is the correct syntax in 1.0.0 to set networking variables in spark
 shell?  Specifically, I'd like to set the spark.akka.frameSize

 I'm attempting this:

 spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g


 Only to get this within the session:

 System.getProperty(spark.executor.memory)
 res0: String = 4g
 System.getProperty(spark.akka.frameSize)
 res1: String = null


 I don't believe I am violating protocol, but I have also posted this to
 SO:
 http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell

 ~~
 May All Your Sequences Converge






Re: Setting network variables in spark-shell

2014-11-30 Thread Yanbo

Try to use spark-shell --conf spark.akka.frameSize=1

 在 2014年12月1日,上午12:25,Brian Dolan buddha_...@yahoo.com.INVALID 写道:
 
 Howdy Folks,
 
 What is the correct syntax in 1.0.0 to set networking variables in spark 
 shell?  Specifically, I'd like to set the spark.akka.frameSize
 
 I'm attempting this:
 spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g
 
 Only to get this within the session:
 
 System.getProperty(spark.executor.memory)
 res0: String = 4g
 System.getProperty(spark.akka.frameSize)
 res1: String = null
 
 I don't believe I am violating protocol, but I have also posted this to SO: 
 http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell
 
 ~~
 May All Your Sequences Converge
 
 
 


Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread shahab
Hi,

I just wonder if there is any implementation for Item-based Collaborative
Filtering in Spark?

best,
/Shahab


Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Jimmy
The latest version of MLlib has it built in no?
J

Sent from my iPhone

 On Nov 30, 2014, at 9:36 AM, shahab shahab.mok...@gmail.com wrote:
 
 Hi,
 
 I just wonder if there is any implementation for Item-based Collaborative 
 Filtering in Spark?
 
 best,
 /Shahab

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].

1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote:

 Spark has a known problem where it will do a pass of metadata on a large
 number of small files serially, in order to find the partition information
 prior to starting the job. This will probably not be repaired by switching
 the FS impl.

 However, you can change the FS being used like so (prior to the first
 usage):
 sc.hadoopConfiguration.set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com
 wrote:

 Thanks Lalit; Setting the access + secret keys in the configuration works
 even when calling sc.textFile. Is there a way to select which hadoop s3
 native filesystem implementation would be used at runtime using the hadoop
 configuration?

 Thanks,
 Tomer

 On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com
 wrote:


 you can try creating hadoop Configuration and set s3 configuration i.e.
 access keys etc.
 Now, for reading files from s3 use newAPIHadoopFile and pass the config
 object here along with key, value classes.





 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread Aaron Davidson
Note that it does not appear that s3a solves the original problems in this
thread, which are on the Spark side or due to the fact that metadata
listing in S3 is slow simply due to going over the network.

On Sun, Nov 30, 2014 at 10:07 AM, David Blewett da...@dawninglight.net
wrote:

 You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].

 1.
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote:

 Spark has a known problem where it will do a pass of metadata on a large
 number of small files serially, in order to find the partition information
 prior to starting the job. This will probably not be repaired by switching
 the FS impl.

 However, you can change the FS being used like so (prior to the first
 usage):
 sc.hadoopConfiguration.set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com
 wrote:

 Thanks Lalit; Setting the access + secret keys in the configuration
 works even when calling sc.textFile. Is there a way to select which hadoop
 s3 native filesystem implementation would be used at runtime using the
 hadoop configuration?

 Thanks,
 Tomer

 On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com
 wrote:


 you can try creating hadoop Configuration and set s3 configuration i.e.
 access keys etc.
 Now, for reading files from s3 use newAPIHadoopFile and pass the config
 object here along with key, value classes.





 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Sean Owen
There is an implementation of all-pairs similarity. Have a look at the
DIMSUM implementation in RowMatrix. It is an element of what you would
need for such a recommender, but not the whole thing.

You can also do the model-building part of an ALS-based recommender
with ALS in MLlib.

So, no not directly, but there are related pieces.

On Sun, Nov 30, 2014 at 5:36 PM, shahab shahab.mok...@gmail.com wrote:
 Hi,

 I just wonder if there is any implementation for Item-based Collaborative
 Filtering in Spark?

 best,
 /Shahab

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Pat Ferrel
Actually the spark-itemsimilarity job and related code in the Spark module of 
Mahout creates all-pairs similarity too. It’s designed to use with a search 
engine, which provides the query part of the recommender. Integrate the two and 
you have a near realtime scalable item-based/cooccurrence collaborative 
filtering type recommender.


On Nov 30, 2014, at 12:09 PM, Sean Owen so...@cloudera.com wrote:

There is an implementation of all-pairs similarity. Have a look at the
DIMSUM implementation in RowMatrix. It is an element of what you would
need for such a recommender, but not the whole thing.

You can also do the model-building part of an ALS-based recommender
with ALS in MLlib.

So, no not directly, but there are related pieces.

On Sun, Nov 30, 2014 at 5:36 PM, shahab shahab.mok...@gmail.com wrote:
 Hi,
 
 I just wonder if there is any implementation for Item-based Collaborative
 Filtering in Spark?
 
 best,
 /Shahab

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread francois . garillot
How about writing to a buffer ? Then you would flush the buffer to Kafka if and 
only if the output operation reports successful completion. In the event of a 
worker failure, that would not happen.



—
FG

On Sun, Nov 30, 2014 at 2:28 PM, Josh J joshjd...@gmail.com wrote:

 Is there a way to do this that preserves exactly once semantics for the
 write to Kafka?
 On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote:
 I'd be interested in finding the answer too. Right now, I do:

 val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))
 kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = {
 writer.output(rec) }) } ) //where writer.ouput is a method that takes a
 string and writer is an instance of a producer class.





 On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi 
 max.toma...@gmail.com wrote:

 Hello all,
 after having applied several transformations to a DStream I'd like to
 publish all the elements in all the resulting RDDs to Kafka. What the best
 way to do that would be? Just using DStream.foreach and then RDD.foreach ?
 Is there any other built in utility for this use case?

 Thanks a lot,
 Max

 --
 
 Massimiliano Tomassi
 
 e-mail: max.toma...@gmail.com
 




Re: Multiple SparkContexts in same Driver JVM

2014-11-30 Thread Harihar Nahak
try setting in SparkConf.set( 'spark.driver.allowMultipleContexts' , true)

On 30 November 2014 at 17:37, lokeshkumar [via Apache Spark User List] 
ml-node+s1001560n20037...@n3.nabble.com wrote:

 Hi Forum,

 Is it not possible to run multiple SparkContexts concurrently without
 stopping the other one in the spark 1.3.0.
 I have been trying this out and getting the below error.

 Caused by: org.apache.spark.SparkException: Only one SparkContext may be
 running in this JVM (see SPARK-2243). To ignore this error, set
 spark.driver.allowMultipleContexts = true. The currently running
 SparkContext was created at:

 According to this, its not possible to create unless we specify the option
 spark.driver.allowMultipleContexts = true.

 So is there a way to create multiple concurrently running SparkContext in
 same JVM or should we trigger Driver processes in different JVMs to do the
 same?

 Also please let me know where the option
 'spark.driver.allowMultipleContexts' to be set? I have set it in
 spark-env.sh SPARK_MASTER_OPTS but no luck.

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037p20055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDDs join problem: incorrect result

2014-11-30 Thread Harihar Nahak
what do you mean by incorrect? could you please share some examples from
both the RDD and resultant RDD also If you get any exception paste that
too. it helps to debug where is the issue

On 27 November 2014 at 17:07, liuboya [via Apache Spark User List] 
ml-node+s1001560n19928...@n3.nabble.com wrote:

 Hi,
I ran into a problem when doing two RDDs join operation. For example,
 RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result
 RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in
 RDDc are  incorrect compared with RDDb. What's wrong in join?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p20056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2014-11-30 Thread Harihar Nahak
Hi, If you haven't figure out so far; could you please share some details:
how you running GraphX ?

also before executing above commands from shell import required GraphX
packages

On 27 November 2014 at 20:49, liuboya [via Apache Spark User List] 
ml-node+s1001560n19959...@n3.nabble.com wrote:

 I'm waiting online. Who can help me, please?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p19959.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p20057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How can a function access Executor ID, Function ID and other parameters

2014-11-30 Thread Steve Lewis
 I am running on a 15 node cluster and am trying to set partitioning to
balance the work across all nodes. I am using an Accumulator to track work
by Mac Address but would prefer to use data known to the Spark environment
-  Executor ID, and Function ID show up in the Spark UI and Task ID and
Attempt ID (assuming these work like Hadoop) would be useful.
Does someone know how code running in a function can access these
parameters. I think I have asked this group several times about Task IDand
Attempt ID without getting a reply.

Incidentally the data I collect suggests that my execution is not at all
balanced


Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
How big is your input dataset?

On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com
wrote:

 Hi,

 When I run the below program, I see two files in the HDFS because the
 number of partitions in 2. But, one of the file is empty. Why is it so? Is
 the work not distributed equally to all the tasks?

 textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
 *reduceByKey*(lambda a, b: a+b).*repartition(2)*
 .saveAsTextFile(hdfs://localhost:9000/user/praveen/output/)

 Thanks,
 Praveen



-- 
- Rishi


Re: Edge List File in GraphX

2014-11-30 Thread Harihar Nahak
Graphloade.edgeListFile(fileName) , where file name must be in 1\t2  form.
about result NaN there might some issue with the data. I ran it for various
combination of data set and it works perfectly fine.

On 25 November 2014 at 19:23, pradhandeep [via Apache Spark User List] 
ml-node+s1001560n1972...@n3.nabble.com wrote:

 Hi,
 Is it necessary for every vertex to have an attribute when we load a graph
 to GraphX?
 In other words, if I have an edge list file containing pairs of vertices
 i.e., 1   2 means that there is an edge between node 1 and node 2. Now,
 when I run PageRank on this data it return a NaN.
 Can I use this type of data for any algorithm on GraphX?

 Thank You


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724p20060.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh,

On Sun, Nov 30, 2014 at 10:17 PM, Josh J joshjd...@gmail.com wrote:

 I would like to setup a Kafka pipeline whereby I write my data to a single
 topic 1, then I continue to process using spark streaming and write the
 transformed results to topic2, and finally I read the results from topic 2.


Not really related to your question, but you may also want to look into
Samza http://samza.incubator.apache.org/ which was built exactly for this
kind of processing.

Tobias


RE: Unable to compile spark 1.1.0 on windows 8.1

2014-11-30 Thread Judy Nash
I have found the following to work for me on win 8.1:
1) run sbt assembly
2) Use Maven. You can find the maven commands for your build at : 
docs\building-spark.md


-Original Message-
From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] 
Sent: Thursday, November 27, 2014 11:31 PM
To: u...@spark.incubator.apache.org
Subject: Unable to compile spark 1.1.0 on windows 8.1

Hi,

I am trying to compile spark 1.1.0 on windows 8.1 but I get the following 
exception. 

[info] Compiling 3 Scala sources to
D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes...
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26:
object sbt is not a member of package com.typesafe [error] import 
com.typesafe.sbt.pom.{PomBuild, SbtPomKeys}
[error] ^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:53: not
found: type PomBuild
[error] object SparkBuild extends PomBuild {
[error]   ^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:121:
not found: value SbtPomKeys
[error] otherResolvers = SbtPomKeys.mvnLocalRepository(dotM2 =
Seq(Resolver.file(dotM2, dotM2))),
[error]^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:165:
value projectDefinitions is not a member of AnyRef
[error] super.projectDefinitions(baseDirectory).map { x =
[error]   ^
[error] four errors found
[error] (plugins/compile:compile) Compilation failed

I have also setup scala 2.10.

Need help to resolve this issue.

Regards,
Ishwardeep 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-compile-spark-1-1-0-on-windows-8-1-tp19996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.akka.frameSize setting problem

2014-11-30 Thread Ke Wang
I meet the same problem, did you solve it ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
4096MB is greater than Int.MaxValue and it will be overflow in Spark.
Please set it less then 4096.

Best Regards,
Shixiong Zhu

2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com:

 I meet the same problem, did you solve it ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Sorry. Should be not greater than 2048. 2047 is the greatest value.

Best Regards,
Shixiong Zhu

2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:

 4096MB is greater than Int.MaxValue and it will be overflow in Spark.
 Please set it less then 4096.

 Best Regards,
 Shixiong Zhu

 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com:

 I meet the same problem, did you solve it ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664

Best Regards,
Shixiong Zhu

2014-12-01 13:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:

 Sorry. Should be not greater than 2048. 2047 is the greatest value.

 Best Regards,
 Shixiong Zhu

 2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:

 4096MB is greater than Int.MaxValue and it will be overflow in Spark.
 Please set it less then 4096.

 Best Regards,
 Shixiong Zhu

 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com:

 I meet the same problem, did you solve it ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Judy Nash
Thanks Patrick and Cheng for the suggestions.

The issue was Hadoop common jar was added to a classpath. After I removed 
Hadoop common jar from both master and slave, I was able to bypass the error. 
This was caused by a local change, so no impact on the 1.2 release. 
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Wednesday, November 26, 2014 8:17 AM
To: Judy Nash
Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on 
Guava

Just to double check - I looked at our own assembly jar and I confirmed that 
our Hadoop configuration class does use the correctly shaded version of Guava. 
My best guess here is that somehow a separate Hadoop library is ending up on 
the classpath, possible because Spark put it there somehow.

 tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 cd org/apache/hadoop/
 javap -v Configuration | grep Precond

Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration

   #497 = Utf8   org/spark-project/guava/common/base/Preconditions

   #498 = Class  #497 //
org/spark-project/guava/common/base/Preconditions

   #502 = Methodref  #498.#501//
org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V

12: invokestatic  #502// Method
org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

50: invokestatic  #502// Method
org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hi Judy,

 Are you somehow modifying Spark's classpath to include jars from 
 Hadoop and Hive that you have running on the machine? The issue seems 
 to be that you are somehow including a version of Hadoop that 
 references the original guava package. The Hadoop that is bundled in 
 the Spark jars should not do this.

 - Patrick

 On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash 
 judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing with 
 the same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class 
 org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 
 --executor-memory 1G --num-executors 1 
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava



 To determine if this is a Windows vs. other configuration, can you 
 just try to call the Spark-class.cmd SparkSubmit without actually 
 referencing the Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash 
 judyn...@exchange.microsoft.com
 wrote:

 I traced the code and used the following to call:

 Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
 spark-internal --hiveconf hive.server2.thrift.port=1



 The issue ended up to be much more fundamental however. Spark doesn't 
 work at all in configuration below. When open spark-shell, it fails 
 with the same ClassNotFound error.

 Now I wonder if this is a windows-only issue or the hive/Hadoop 
 configuration that is having this problem.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Tuesday, November 25, 2014 1:50 AM


 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava



 Oh so you're using Windows. What command are you using to start the 
 Thrift server then?

 On 11/25/14 4:25 PM, Judy Nash wrote:

 Made progress but still blocked.

 After recompiling the code on cmd instead of PowerShell, now I can 
 see all 5 classes as you mentioned.

 However I am still seeing the same error as before. Anything else I 
 can check for?



 From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
 Sent: Monday, November 24, 2014 11:50 PM
 To: Cheng Lian; u...@spark.incubator.apache.org
 Subject: RE: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava



 This is what I got from jar tf:

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class

 com/clearspring/analytics/util/Preconditions.class

 parquet/Preconditions.class



 I seem to have the line that reported missing, but I am missing this file:

 com/google/inject/internal/util/$Preconditions.class



 Any suggestion on how to fix this?

 Very much appreciate the help as I am very new to Spark and open 
 source technologies.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Monday, 

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
Thanks Judy. While this is not directly caused by a Spark issue, it is
likely other users will run into this. This is an unfortunate
consequence of the way that we've shaded Guava in this release, we
rely on byte code shading of Hadoop itself as well. And if the user
has their own Hadoop classes present it can cause issues.

On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash
judyn...@exchange.microsoft.com wrote:
 Thanks Patrick and Cheng for the suggestions.

 The issue was Hadoop common jar was added to a classpath. After I removed 
 Hadoop common jar from both master and slave, I was able to bypass the error.
 This was caused by a local change, so no impact on the 1.2 release.
 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Wednesday, November 26, 2014 8:17 AM
 To: Judy Nash
 Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on 
 Guava

 Just to double check - I looked at our own assembly jar and I confirmed that 
 our Hadoop configuration class does use the correctly shaded version of 
 Guava. My best guess here is that somehow a separate Hadoop library is ending 
 up on the classpath, possible because Spark put it there somehow.

 tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 cd org/apache/hadoop/
 javap -v Configuration | grep Precond

 Warning: Binary file Configuration contains 
 org.apache.hadoop.conf.Configuration

#497 = Utf8   org/spark-project/guava/common/base/Preconditions

#498 = Class  #497 //
 org/spark-project/guava/common/base/Preconditions

#502 = Methodref  #498.#501//
 org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V

 12: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

 50: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V

 On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hi Judy,

 Are you somehow modifying Spark's classpath to include jars from
 Hadoop and Hive that you have running on the machine? The issue seems
 to be that you are somehow including a version of Hadoop that
 references the original guava package. The Hadoop that is bundled in
 the Spark jars should not do this.

 - Patrick

 On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash
 judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing with
 the same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.examples.SparkPi --master spark://headnodehost:7077
 --executor-memory 1G --num-executors 1
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 To determine if this is a Windows vs. other configuration, can you
 just try to call the Spark-class.cmd SparkSubmit without actually
 referencing the Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash
 judyn...@exchange.microsoft.com
 wrote:

 I traced the code and used the following to call:

 Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
 spark-internal --hiveconf hive.server2.thrift.port=1



 The issue ended up to be much more fundamental however. Spark doesn't
 work at all in configuration below. When open spark-shell, it fails
 with the same ClassNotFound error.

 Now I wonder if this is a windows-only issue or the hive/Hadoop
 configuration that is having this problem.



 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Tuesday, November 25, 2014 1:50 AM


 To: Judy Nash; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 Oh so you're using Windows. What command are you using to start the
 Thrift server then?

 On 11/25/14 4:25 PM, Judy Nash wrote:

 Made progress but still blocked.

 After recompiling the code on cmd instead of PowerShell, now I can
 see all 5 classes as you mentioned.

 However I am still seeing the same error as before. Anything else I
 can check for?



 From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
 Sent: Monday, November 24, 2014 11:50 PM
 To: Cheng Lian; u...@spark.incubator.apache.org
 Subject: RE: latest Spark 1.2 thrift server fail with
 NoClassDefFoundError on Guava



 This is what I got from jar tf:

 org/spark-project/guava/common/base/Preconditions.class

 org/spark-project/guava/common/math/MathPreconditions.class