Unsubscribe
Unsubscribe
Re: PySpark: slicing issue with dataframes
Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Trying to use the slicing functionality in strings as part of a Spark program (PySpark) I get this error: Code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname': ['Jones', 'Bajwa', 'Day']}) a = hc.createDataFrame(A) print A b = a.select(a.Firstname[:2]) print b.toPandas() c = a.select(a.Lastname[2:]) print c.toPandas() Output: Firstname Lastname 0 JamesJones 1 AliBajwa 2Daniel Day SUBSTR(Firstname, 0, 2) 0 Ja 1 Al 2 Da --- Py4JError Traceback (most recent call last) ipython-input-17-6ee5d7d069ce in module() 10 b = a.select(a.Firstname[:2]) 11 print b.toPandas() --- 12 c = a.select(a.Lastname[2:]) 13 print c.toPandas() /home/jupyter/spark-1.3.1/python/pyspark/sql/dataframe.pyc in substr(self, startPos, length) 1089 raise TypeError(Can not mix the type) 1090 if isinstance(startPos, (int, long)): - 1091 jc = self._jc.substr(startPos, length) 1092 elif isinstance(startPos, Column): 1093 jc = self._jc.substr(startPos._jc, length._jc) /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 302 raise Py4JError( 303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'. -- 304 format(target_id, '.', name, value)) 305 else: 306 raise Py4JError( Py4JError: An error occurred while calling o1887.substr. Trace: py4j.Py4JException: Method substr([class java.lang.Integer, class java.lang.Long]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Looks like X[:2] works but X[2:] fails with the error above Anyone else have this issue? Clearly I can use substr() to workaround this, but if this is a confirmed bug we should open a JIRA. Thanks, Ali
PySpark: slicing issue with dataframes
Hi experts, Trying to use the slicing functionality in strings as part of a Spark program (PySpark) I get this error: Code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname': ['Jones', 'Bajwa', 'Day']}) a = hc.createDataFrame(A) print A b = a.select(a.Firstname[:2]) print b.toPandas() c = a.select(a.Lastname[2:]) print c.toPandas() Output: Firstname Lastname 0 JamesJones 1 AliBajwa 2Daniel Day SUBSTR(Firstname, 0, 2) 0 Ja 1 Al 2 Da --- Py4JError Traceback (most recent call last) ipython-input-17-6ee5d7d069ce in module() 10 b = a.select(a.Firstname[:2]) 11 print b.toPandas() --- 12 c = a.select(a.Lastname[2:]) 13 print c.toPandas() /home/jupyter/spark-1.3.1/python/pyspark/sql/dataframe.pyc in substr(self, startPos, length) 1089 raise TypeError(Can not mix the type) 1090 if isinstance(startPos, (int, long)): - 1091 jc = self._jc.substr(startPos, length) 1092 elif isinstance(startPos, Column): 1093 jc = self._jc.substr(startPos._jc, length._jc) /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 302 raise Py4JError( 303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'. -- 304 format(target_id, '.', name, value)) 305 else: 306 raise Py4JError( Py4JError: An error occurred while calling o1887.substr. Trace: py4j.Py4JException: Method substr([class java.lang.Integer, class java.lang.Long]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Looks like X[:2] works but X[2:] fails with the error above Anyone else have this issue? Clearly I can use substr() to workaround this, but if this is a confirmed bug we should open a JIRA. Thanks, Ali
Re: Question regarding join with multiple columns with pyspark
Thanks again Ayan! To close the loop on this issue, I have filed the below JIRA to track the issue: https://issues.apache.org/jira/browse/SPARK-7197 On Fri, Apr 24, 2015 at 8:21 PM, ayan guha guha.a...@gmail.com wrote: I just tested, your observation in DataFrame API is correct. It behaves weirdly in case of multiple column join. (Maybe we should report a Jira?) Solution: You can go back to our good old composite key field concatenation method. Not ideal, but workaround. (Of course you can use realSQL as well, as shown below) set up Data: a = [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,1]] b = [[1993,1,A],[1994,1,AA],[2000,1,AAA]] YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn = int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+~+str(tup[1]))) YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn = int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+~+str(tup[1]))) print YM1.collect() print YM2.collect() YM1DF = ssc.createDataFrame(YM1) YM2DF = ssc.createDataFrame(YM2) print YM1DF.printSchema() print YM2DF.printSchema() This DOES NOT WORK --- YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and YM1DF.mn==YM2DF.mn,inner) print YMJN.printSchema() for l in YMJN.collect(): print l Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) Row(joiningKey=u'1993~1', mn=1, price=1, yr=1993, joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) Row(joiningKey=u'1994~1', mn=1, price=1, yr=1994, joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) - SQL Solution - works as expected YM1DF.registerTempTable(ymdf1) YM2DF.registerTempTable(ymdf2) YMJNS = ssc.sql(select * from ymdf1 inner join ymdf2 on ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn) print YMJNS.printSchema() for l in YMJNS.collect(): print l Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) - Field concat method, works as well YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,inner) print YMJNA.printSchema() for l in YMJNA.collect(): print l Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa ali.ba...@gmail.com wrote: Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2 dataframes with more than 1 column. The example I've seen in the documentation only shows a single column - so I tried this: Example code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', '12', '12'], 'value': [100, 200, 300]}) a = hc.createDataFrame(A) B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], 'value': [101, 102]}) b = hc.createDataFrame(B) print Pandas # try with Pandas print A print B print pd.merge(A, B, on=['year', 'month'], how='inner') print Spark print a.toPandas() print b.toPandas() print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() *Output Pandas month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 Empty DataFrame Columns: [month, value_x, year, value_y] Index: [] Spark month value year 0 5100 1993
Re: Question regarding join with multiple columns with pyspark
Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2 dataframes with more than 1 column. The example I've seen in the documentation only shows a single column - so I tried this: Example code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', '12', '12'], 'value': [100, 200, 300]}) a = hc.createDataFrame(A) B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], 'value': [101, 102]}) b = hc.createDataFrame(B) print Pandas # try with Pandas print A print B print pd.merge(A, B, on=['year', 'month'], how='inner') print Spark print a.toPandas() print b.toPandas() print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() *Output Pandas month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 Empty DataFrame Columns: [month, value_x, year, value_y] Index: [] Spark month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 month value year month value year 012200 200512102 1993 112200 200512101 1993 212300 199412102 1993 312300 199412101 1993 It looks like Spark returns some results where an inner join should return nothing. Am I doing the join with two columns in the wrong way? If yes, what is the right syntax for this? Thanks! Ali - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question regarding join with multiple columns with pyspark
Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2 dataframes with more than 1 column. The example I've seen in the documentation only shows a single column - so I tried this: Example code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', '12', '12'], 'value': [100, 200, 300]}) a = hc.createDataFrame(A) B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], 'value': [101, 102]}) b = hc.createDataFrame(B) print Pandas # try with Pandas print A print B print pd.merge(A, B, on=['year', 'month'], how='inner') print Spark print a.toPandas() print b.toPandas() print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() *Output Pandas month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 Empty DataFrame Columns: [month, value_x, year, value_y] Index: [] Spark month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 month value year month value year 012200 200512102 1993 112200 200512101 1993 212300 199412102 1993 312300 199412101 1993 It looks like Spark returns some results where an inner join should return nothing. Am I doing the join with two columns in the wrong way? If yes, what is the right syntax for this? Thanks! Ali - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org